Logo ©1994-2007 Kevin Boone
My professional interests
Computing
Law
Education
Science and research

My leisure interests
Martial arts
Heritage railways
Garden railways
Motorcycles
DIY

Downloads
Linux downloads
Windows downloads
Java downloads
Perl downloads
Home automation downloads

About me
Home & family
My CV

Site info
Contact the author
Download policy
Keyword index

  Home > Downloads > Java downloads

ebook converter

Last modified: Mon Feb 19 12:06:37 2007

Version 0.1a, Copyright ©2007 Kevin Boone
Distributed under the terms of the GNU Public Licence

What is this?

ebookconverter is a simple utility for converting ebook documents in HTML, text, and PDF formats from one of these formats to another. By `ebook documents' I mean long documents whose main component is text, although simple formatting is supported. Another characteristic of ebooks, as documents, is that a presentation that is easy to read, and makes effective use of the screen or display area, is more important than precise layout.

In particular, this software is intended to facilitate the reading of ebooks on hand-held computers and media players, which usually support only a very limited range of document formats, and have small screens (small, that is, compared to printed paper). For example, the Archos x04 range (404, 504, 604) of portable media players have small (2"-3") screens, and can read only PDF format. While it is, of course, possible to convert documents to these formats using modern word processors, word processors are not usually aware of the constraints of small screens. In particular, they do not usually re-flow text to use all the available screen area. This program does. What's more, because this program is entirely driven by command-line operation, it can easily be embedded in scripts or batch files, and used to do bulk conversion -- something that is not practicable with a word processor.

ebookconverter is written entirely in Java, and should run on most modern computer platforms. It has mostly been tested under Linux. It is intended to run under Sun's JDK 1.5 Java run-time engine, or something very similar. ebookconverter makes use of a substantial number of open-source components from the Apache project and elsewhere -- see below for legal and copying information.

HTML support in this program is fairly rudimentary. Only basic formatting tags are recognized. As with plain text, the program works on the assumption that the document is to be read on a small screen, and adjusts the layout accordingly. Please remember that the main goal of this program is to get something readable on small screens. If you're looking for maximum fidelity between the source and desitination document formats, this program is not for you.

Getting the software

The source and binary package can be downloaded from here. If you unzip this archive you will find, among other things, a file ebookconverter.jar, which is the only file you need to run the application.

Prerequisites

To use ebookconverter you need JDK1.5.0 or later, or something compatible, and the java run-time engine needs to be runnable from the command line. That is, if you do:
% java
on a Unix system or
c:\> java
on Windows you should get a Java usage message, and not an error message of some kind. If you need to enter the full path to java that's OK, but it will be less convenient.

I have done most testing with JDK 1.5.0_04. I would expect anything later than this to work, but I can't promise. I haven't tested with any Java implementation but Sun's, and I don't propose to. It may work; it may not. This program will very likely not work with JDK 1.4.x JVMs, even if you recompile it to suit, because of the way the embedded XML parser has changed between 1.4.x and 1.5.0.

PDF generation is a memory-intensive process. Memory requirements vary according to the size of the document but, for very long documents (more than a few hundred pages), you may find that up to a gigabyte of memory is required. Of course, this need not be real, physical RAM; but if most of it isn't, you can expect a certain amount of hard disk thrashing.

Installation

The entire application is contained in a single file: ebookconverter.jar. Please copy this file to any convenient place. I would suggest /usr/lib on Unix systems, and \windows\system32\ on Windows. The choice of directory is entirely arbitrary, but if the name is quick to type, that will probably be more convenient. I will assume in what follows that you have installed in /usr/lib.

The archive is supplied with a number of predefined formatting profiles to suit the Archos x04 personal media players. There are in the properties directory. If you wish to use these, they can be copied to any convenient place.

Basic usage

ebookconverter has a great many command-line options for fine-tuning its operation. These will all be described in detail below. However, the simplest mode of operation is:
java -jar /usr/lib/ebookconverter.jar [input_file] [output_file] 
You will need to replace /usr/lib with whatever directory you installed in. input_file is the name of an existing text, HTML, PDF, or XSL-FO file. output_file is the name of a text, HTML, or PDF file which may, or may not, exist. If it exists, it will be over-written without warning.

In this mode of operation, the program infers the document types from the filename extensions. Unless told otherwise, it assumes that HTML files have names that end in .htm or .html, and text files have names that end in .txt or .text. You can override this assumption if necessary (see below for details).

Unless told otherwise, the program will produce output suitable for printing on A4 paper. For any other document or screen size, you will need to modify the operation with command-line switches or profile files.

Conversions supported

ebookconverter converts between text, HTML, and PDF files. It can also convert from, but not yet to, XSL-FO files. XSL-FO is a relatively new standard for presenting formatted text. Not all these conversions produce staggeringly impressive results, for reasons that will be described below. ebookconverter can also convert files into different instances of the same format. Why? Consider the Archos 604 media player, for example. It can read only PDF documents, and has a screen about 150mm wide by 100mm high. A PDF document formatted for, say, A4 paper will be readable on this device, but not very well. This is because if you scale the document so it fits on the screen, the text will be too small to read. And if you don't, you'll have to pan around the document to get from the beginning of a text line to the end. You'll be able to read it, but not comfortably, and certainly not quickly.

However, if you convert the original PDF to a new PDF using ebookconverter, and set the output document size to 150mm x 100mm, you'll get a document whose text fills the whole screen when viewed on the Archos device, and which fits the screen exactly -- no zooming or panning required. Such a document will be much more comfortable to read.

Of course, you'll lose all the layout and most of the formatting information in the original file. There are both technical and pragmatic reasons for this. The technical reasons are discussed later in this document; the main pragmatic reason is that it is essentially the layout information in the original file that stopped it fitting the screen in the first place.

Another example: some ebook viewers for handheld devices expect plain text files to be formatted to the correct line length to fit the screen. This program can take a text file which is formatted to, say, 80 columns per line, and convert into a document with 30 columns per line, to suit a smaller display.

The rest of this section describes what can be expected of the particular conversions this software performs.

Text input documents

Plain text (traditionally in 7-bit ASCII representation, but more recently in the various Unicode encoding formats), is a ubiquitous way of storing documents. Nearly all computers can render (display) text documents. However, such documents are normally formatted on the assumption that a fixed-width font will be used, and that the width of the screen display or paper is known with respect to the font width (typically 70-80 characters across). It is not easy or comfortable to read a fixed-pitch font for a long period of time, and if the width of the paper or display device happens not to match the width of the document it will either be truncated, or not fill the space effectively. When printing on paper, the fact that there are large amounts of white space is not usually a problem (except for the trees), but if you want to read a document on a hand-held computer with a small screen this is a big deal -- both truncation and wasted screen area are likely on many documents.

This means that, to read a long plain text document painlessly, it is usually necessary to restructure it to fit the paper or display device. This is much less straightforward than it sounds, because there are few layout clues in the document.

ebookconverter reformats plain text documents with some understanding of the way that such documents are usually authored. For example, it assumes that a blank line between two lines of text is to be treated as a paragraph separator, and that a line of `-' or `=' signs under a line of text denotes that line as a heading. However, because there are no standards in this area, it is possible for the program to get the layout badly wrong. The algorithm the program uses allows for some tweaking, and it may be necessary to experiment with the settings to get reasonable results.

HTML input documents

HTML does not suffer the same problems as plain text, because layout hints are given in the file. However, in practice, HTML suffers from two other problems: it is often badly-formed, and many authors use HTML tags not just to hint at the layout, but to specify it to the millimetre. ebookconverter makes some attempt to process badly-formed HTML, but results will vary. As for over-specification of layout, the program tries to get around that problem by ignoring layout, size, and positioning information that makes no sense for a small display (i.e., most of it).

Please bear in mind that this program is designed to convert long but, essentially, simple HTML documents -- that is, HTML the way it is typically used in e-books. It is not intended to be a fully-featured HTML-to-PDF converter. Most HTML tags and styles are ignored altogether. The following tags have at least some meaning to the program: B, BLOCKQUOTE, BR, CODE, DIV, DD/DD/DL, EM, FONT, I, IMG, H1, H2, H3, H4, HR, P, PRE, OL/LI, STRONG, TABLE/TBODY/TD/TR, TT, UL/LI.

Even these tags have limitations. Some of the most obvious ones are as follows:

  • Most explicit size and layout attributes, apart from those concerned with horizontal alignment, are ignored completely. I developed this program primarily for producing PDF for small displays, not paper documents. Much of the layout information used in HTML files, and the way measurements are made (usually in pixels) are unhelpful with such displays.
  • Table support is rudimentary. Tables are rendered with equal-sized columns, because the size information in HTML is unlikely to be useful in a PDF document formatted for a small display. Tables in which empty columns are used for layout fare particularly badly, because the empty columns take up as much room on the page as the columns with content
  • Font tags are only processed for size. Font face and style is ignored. Presently the program does not support enough different fonts to make fully support for the FONT tag workable.
  • Image support is rudimentary. Some JPEG, GIF, and PNG images can be processed but, most likely, they will only be found if they are specified relative to the document directory in the HTML document itself. The BASE tag is not processed. In addition, image size information is ignored, since it is usually in pixels. The image is scaled so that there are 72 pixels per page inch, and centred on the page. No text is wrapped around the image.
Most Web browsers display text with a ragged right margin (left justified). Because many people find large amounts of this sort of text difficult to read, this program defaults to full (left and right) justification for text. However, it does respect alignment attributes in paragraph tags, so if an HTML file asks for a ragged right margin, it will get it. To use ragged right for any HTML document (or any document, for that matter), specify -f default-alignment=left.

PDF input documents

The PDF format is designed for precise page rendering, unlike HTML which contains what are, at best, hints for the renderer. Provided that the author hasn't put precise layout instructions into the docuement, HTML is easy to convert to other formats because it is clear what bits of text are headings, lists, paragraphs, etc.

This is not so for PDF. Instead, the PDF document has to be converted to text by analysing individual characters and assembling them into words and lines. There is no information in the PDF file to indicate the function of a particular piece of text. In short, getting anything other than a stream of words out of a PDF document is exceptionally difficult. This program uses various rules-of-thumb for inferring elementary page layout from the stream of words that come out of the document, but results will vary between documents, according to how they were originally authored.

Some PDF documents cannot be converted at all. At present there is no support for encypted documents, although such support could probably be added if there was a demand for it. But, more often, conversion fails because there is, in fact, no recognizable text in the document. It isn't all that uncommon to find PDF files that have been produced by scanning paper documents and inserting them into the PDF as large bitmps. The PDF document may look like it contains text, but the deception becomes apparent when you try to scale the document -- it will look hideous. In any event, ebookconverter will not be able to extract the text from such a document, because there isn't any.

Conversion summary

To plain text To HTML To PDF
From plain text Output is re-flowed according to format settings. Some of the layout and formatting of the source document will be preserved. Some of the layout and formatting of the source document will be preserved, and passed into the HTML Some of the layout and formatting of the source document will be preserved, and passed into the PDF
From HTML Text is re-flowed according to formatting settings, and emphasised and layed out so far as is practicable given the simplicity of the target format The source HTML is cleaned up, by inserting any missing close tags. Otherwise, the document is unchanged. Most of the simpler HTML formatting is passed through into the PDF, with text being flowed to suit the document size. Most layout information in the HTML is ignored.
From PDF Text is extracted. The program lays out the output text by heuristics (guesswork). Text is extracted. The program lays out HTML by heuristics (guesswork). No formatting is extracted so the HTML is flat. Text is extracted. The program lays out the PDF by heuristics (guesswork) according to the output document size. No formatting is extracted so the PDF is flat.
From XSL-FO Text is and emphasised and layed out as specified in the XSL-FO document, so far as is practicable given the simplicity of the text format. In practice, most FO instructions are ignored, because they cannot easily be rendered in plain text Most of the simpler XSL-FO formatting and layout instructions are translated to HTML. This conversion, XSL-FO to PDF, is well supported, as most FO instructions have PDF equivalents.

Command-line switches

ebookconverter accepts a large number of command-line switches, to specify source processing and destination formatting. The defaults work adequately for many situations, but if you're producing PDF for a handheld device, you will at least need to specify the document size. Getting a document size to fit the screen, while giving readable text, will usually be a matter of trial and error.

Frequently-used formatting options can be placed in profile files for future use, as described later.

The basic format of the command line is:

java -jar ebookconverter [-s source_options] [-f format_options] [-p profile_file] {input_file} {output_file}
For example:
java -jar ebookconverter \
  -s encoding=utf-16,respect-linebreaks \
  -f width=150mm,height=90mm \
  my-text.txt my-pdf.pdf
(all on a single line, if you're using Windows)

The source_options switch controls how the source document is processed. format_options controls the formatting of the output document. profile is the name of a file that contains formatting information. This profile file can be used in addition to, or instead of, specify formatting options on the command line. In practice, if you are converting a large number of files, use of a profile file is highly recommended, as entering the same information on the command line, over and over again, is tiresome.

The source-options and format-options switches

The general format of these switches is:
-s flag1,flag2,name1=value1,name2=value2...
-f flag1,flag2,name1=value1,name2=value2...
Note that there must be no spaces between the comma-separated options. Any number of options can be given -- some take values (e.g., -s encoding=utf-16), others are merely flags (e.g., -s respect-linebreaks). All -s options must be given in a single block. You can't, for example, say -s... -s....

source-options

The source-options settings control how the source document is parsed. The defaults will work for many documents, but better results can often be achieved by tweaking.

respect-linebreaks

-s respect-linebreaks,...
If specified, ebookconverter treats all end-of-line markers in the file as indicating a line break in the output document. This option is only valid for text files -- HTML files never contain meaningful line breaks, except within PRE (preformat) tags, where they are also respected.

Without this option, the program ignores lines breaks, and attempts to work out the document structure itself. This nearly always produces better results. However, in some documents the line breaks are highly significant, particularly poetry, plays, and program listings. With such documents you may have no alternative than to enable respect-linebreaks. The consequence of doing so is that you will end up with ragged lines and lots of white space at the ends of lines. This does not make efficient use of the screen on small displays.

type

-s type=text|html|pdf|guess,...
Specify the file type. The default is `guess', in which case the program uses the filename extension to guess the contents.

ignore-emphasis

-s ignore-emphasis,...
If this option is set, the program ignores emphasis in text documents, such as *bold* and _italic_. This may be necessary if the document uses these symbols for something else. This option also forces the program to ignore layout cues such as rows of `=' signs to underline a heading.

force-new-para-on-tab

-s force-new-para-on-tab,...
If set, any tab indent from the left margin of a text document indicates a paragraph break. If not set, documents which appear to use a tab this way are assumed to do so.

encoding

-s encoding=utf-8|Cp1252|...,...
Traditionally, text documents were produced in 7-bit ASCII format. Since there are usually 8 bits to a byte, many computer and software vendors came up with their own extensions that used 8 bits, gaining another 127 characters over the ASCII standard. More recently, multi-byte encoding has become common, particular the Unicode standard.

Unless otherwise specified, ebookconverter assumes that input documents are in Unicode UTF-8 format. UTF-8 is a multi-byte format but, as the first 127 characters are the same as in ASCII, plain ASCII files will be read as well. You may also be able to process documents with other single-byte character sets in this mode, but the results will be variable. If possible, it is better to specify the correct character encoding, whether you are using a single-byte or multi-byte document.

In general, there is no way to determine the character encoding from inspecting a plain text document. Not realiably, anyway, although some software has a fair guess. ebookconverter requires the encoding to be specified explicitly, or for you to hope for the best. HTML documents sometimes specify the encoding, like this:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;charset=utf8">
If this tag is present, the encoding specified there takes priority over the command-line option.

This program does not process character encodings itself -- it just uses the Java language features. A full list of the character encodings supported the Sun JVMs can be found on Sun's web site.

All Java implementations will support US-ASCII, UTF-8, and ISO-8895-1. The latter is a very common, single-byte extension to ASCII that has Western but non-English characters in its higher-numbered positions. If you read an ISO-8859-1 document as if it were UTF-8 (the default) you can expect some minor character transpositions. In many cases these will not be irksome.

A particular problem for document conversion is the Microsoft variant of ISO-8859-1, which these days is known as `Cp1252'. A great deal of Microsoft software, notably the `Office' products, produce output in this format if asked to write a plain text file. Cp1252 uses some extra punctuation characters, such as a reverse apostrophe. These characters exist in UTF-8, but did not exist in ISO-8859-1. If you see odd characters in the PDF output where you expect apostrophes or quotation marks, most likely you have a Microsoft text document. This form of encoding does turn up in HTML files as well as plain text documents. It shouldn't really, but it does. Try -s encoding=Cp1252.

format-options

Format options control how the output document is formatted. These options can all be specified in a profile file as well as on the command line (see below).

Many of these options are measurements. Measurements can be specified in real units (mm, cm), points (pt), or `ems'. An em is the approximate width of the letter `m' in the current font, and is useful when you want a size that will scale with the font size.

In general, there are no default units, and you should always specify the units explicitly. The exception is textwidth (see below), which is measured in characters.

allow-orphan-titles

-f allow-orphan-titles,... 
If this option is supplied, a title is allowed to be the last line on a page. Normally the program will shift titles onto the next page in this situation, because such an `orphaned title' looks ugly. However, when formatting for small displays, the amount of whitespace created by pushing titles forward may be unhelpful, and lead to too much page-turning.

This option only has much effect with HTML files, because the program may be unable to prevent orphaned titles with plain text files -- the distinction between titles and text in such files is often very unclear.

block-space

-f block-space=0.5em,...
The amount of space added below, and sometimes above, each block of text. A block is a paragraph, table, list, or table. Specifying the size in ems makes it relative to the current font, and therefore adjusts automatically for font size. For very small displays, you might need to set this as low as 0.1em. For printing, 0.5em-1.0em works best.

default-alignment

-f default-alignment=left|justify|right,...
If unspecified, text is justified (that is, fit to the left and right margins) by default when output as PDF. An HTML file can override the default, but note that there is no explicit way in HTML of requesting justified text. That is, you can't convert a text or PDF file to HTML and get justified output, because HTML does not support it.

Most people prefer to read text which is justified. In general, you won't get more (nor less) text on the screen that way -- it's just a matter of reading preference.

font-family

-f font-family=[name],... 
e.g.,
-f font-family=Times,... 
ebookconverter supports only the three standard letter font families that are required by the PDF Specification to work with all PDF viewers. These are `Helvetica', `Times', and `Courier'. In principle, 'Symbol', and `ZapfDingbats' are also supported (as per the PDF Specification), but these contain few letter shapes, and so are unlikely to be useful.

Note that font family names are case-sensitive. The progam will output a warning message if given the name of an unsupported font family. You can use the names serif and sans-serif as shortcuts for the default serif and sans-serif fonts respectively.

In principle, the FOP rendering engine (see below) which ebookconverter uses to render the final output does support other fonts than the basic set. However, the mechanisms for doing this are platform-dependent, and I did not want to introduce platform-dependencies into this document. Users who want a larger selection of fonts should refer to the Apache FOP documentation, available from the Apache Web site.

font-size

-f font-size=[name],... 
e.g.,
-f font-size=8pt,... 
Specifes the size for the base document font, in points. Point sizes from 6 to 12 usually work best. You'll get more on screen with a smaller font size, because the program will try to fill as much of the document as it can. But, of course, small fonts are harder to read. The Archos units handle fonts down to 6 point, but this is a strain on the eyes. 8-10 point sizes seem to work best. Headings are displayed in a larger font of the same typeface.

The default base font size is 10pt.

margin

-f margin=[size],... 
e.g.,
-f margin=20mm,... 
The margin setting is applied equally to the top, bottom, left, and right margins. These cannot be set independently (yet). For screen viewing, a margin of a few millimetres is usually enough. You could set it to zero, but this makes for surprisingly uncomfortable reading.

no-postproc

-f no-postproc,... 
Do not post-process text after it has come out of the transformation from intermediate XHTML (see the section `Technical details' below). This option is only useful if you are using a custom stylesheet (see the stylesheet) option to produce text output. Normally, disabling post-processing will result in very ugly text, with layout hint codes embedded in it, which won't be very useful.

page-width and page-height

-f page-height=[size],page-width=[size],... 
e.g.,
-f page-height=100mm,page-width=150mm,... 
These switches specify the size of the output document. This `document size' is a notional measure when formatting for reading on a screen, particularl a small screen, and you won't necessarily get a document that is a perfect fit to the screen by measuring it. But that should, at least, provide a start, which you can refine by trial-and-error.

If the document is a perfect fit to the screen, then you should not have to scroll to see different bits of each page. One screenful will be equal to one page. This is particularly important on hand-held devices, because scrolling is often not very fast.

For Archos devices, the sample profiles archos604, etc., have document sizes which fill the screen exactly, or as near exactly as I was able to determine.

The default page size is A4.

stylesheet

-f stylesheet=[filename],... 
Specify an XSL stylesheet to be used in the transformation of the XHTML which results from parsing the input file, into whatever output is required. Stylesheets are included for the basic transformations that the program supports, although some could (putting it mildly) be improved.

For PDF output, the stylesheet must take well-formed XHTML and produce XSL-FO. For other output formats, the stylesheet can produce whatever output is required (or possible).

textwidth

-f textwidth=[width],... 
e.g.,
-f textwidth=60,... 
Specifies the number of characters wide to make plain text files. This setting is only meaningful with plain text output -- with PDF output you need to set page-width, page-height and margin. With HTML, the text width is irrelevant, as text is reflowed by the browser to suit the screen size.

twocolumn

-f twocolumn,... 
Selects two-column output. This option only works with PDF output.

type

-s type=text|html|pdf|fo|guess,...
Specify the file type. The default is `guess', in which case the program uses the filename extension to guess the contents.

Profiles

To avoid entering a large number of formatting paramters on the command line, ebookconverter supports profile files. A profile file contains settings of the form
name=value
name=value
The options set in a profile file can be any valid option that would apply to the -f switch described above. To use a profile, specify the filename with -p, eg.,
  java -jar ebookconverter.jar -p profiles/archos604 in.txt out.pdf
Options entered on the command line take priority over those in the profile if both locations are used. This means that you can use a profile file, and override only certain options on each operation.

Technical details

ebookconverter does not contain a huge amoung of original code: it merely bolts together a number of existing open-source components. In particular, it uses the Apache FOP library to generate PDF, PDFBox for stripping text out of PDF Files, Andy Clark's NekoHTML for fixing up broken HTML files before parsing, Xerces for parsing XML and well-formed HTML, Xalan for transforming XML from one format to another, and a number of other components.

ebookconveter uses XSL-FO as an intermediate format for PDF generation. XSL-FO is a (relatively) new standard for describing printable documents, similar in concept (but not in implementation) to TeX and LaTeX. Files destined for PDF output are first transformed into well-formed XHTML (which involves a certain amount of guesswork), and then via an XSL transformation into XSL-FO. The XSL-FO is then rendered into PDF by FOP.

Files destined for plain text or HTML output are first transformed into XHMTL, as above, then via another XSL transformation into the output format. Plain text output is then post-processed to tidy up the output of the XSL transformation, which is not designed to produce neat, readable text.

This multi-stage process sounds complicated and, indeed, it is. But my intention is that by doing things this way, it will be easier to add support for other file types in future.

Processing arbitrary XML files

The use of proprietary XML markup for ebook-type documents is rare, but not unheard of. ebookconverter is able to convert XML to PDF, plain text, or another XML format, if the program is supplied with a suitable XSL stylesheet to effect the translation. An example of this usage may be found in the samples directory of the source code distribution. To convert to PDF, the transformation should produce XSL-FO; otherwise it can produce what output is desired.

Creating a suitable stylesheet for anything but a trivial XML format is a demanding undertaking, and one that requires considerable expertise in XSL and (for PDF output) XSL-FO. Such matters are beyond the scope of this document.

Bugs and limitations

Those I know about include the following:
  • The program does little to check whether the command-line arguments make sense. For example, if you ask it to produce a PDF document 1mm wide, it will try to. It might crash.
  • With long text files, this program uses a lot of memory and is very slow. If you get `Out of heap space' error messages or the like, try running the program with a larger Java heap. For example, to run with a 500Mb heap:
    java -Xmx500m -jar textebookconverter.jar in.txt out.pdf
    
    You probably shouldn't set -Xmx much higher than you find in practice that you need, because higher values increase the start-up time of the program.
  • In text mode, this program does not handle any HTML tags. The only markup it understands are _italic_ and *bold*, in addition to the common layout cues that are used in text documents. Any HTML tags will be rendered exactly as the appear in the document. If you want to use HTML, use HTML mode.
  • This program is reasonably tolerant of badly-formed HTML, although results are not always particularly impressive. However, there are certain formatting errors that break the program. Sorry.
  • There is little that ebookconverter can do when confronted with a really abberant document -- jumbled up character encodings, bizarre mixtures of DOS and Unix end-of-line markers, documents with no formatting at all, documents containing very long lines of text interspersed with short lines, HTML with numeric character codes which only work on broken web browsers, and so on. All these things are reasonably common, unfortunately. I've done my best to make the program produce reasonable output with reasonable input; unreasonable input is more of a challenge.
  • As a particular example of the previous point, I've seen quite a few ebooks which use \r\r\n as an end of line marker. I have no idea -- it isn't legal on any operating system I've come across. \r is a carriage return -- if you've returned the carriage, you can't return it again without moving it again! ebookconverter treats this combination as a line terminator followed by a blank line; or, rather, Java treats it this way. The spurious blank line messes up the end-of-paragraph algorithm, leading to unsatisfactory formatting. This problem is surprisingly difficult to fix. So I haven't.
  • The program may report a `File not found' error because it can't create the PDF output file (perhaps because you have specified a non-existent directory). That's just the way Java works.
  • At present, plain text output cannot be fully (left and right) justified. This is possible in principle, but I just haven't found the motivation really to grit my teeth and get down to writing the code. Because you can't represent a variable-width space in a plain text document, it doesn't often look that marvellous either, especially on small displays.
  • The PDF renderer (FOP) behaves very badly if presented with a very large (megabytes) block of completely unformatted text. This, to the renderer, is a single paragraph, and has to be formatted as such. The program will probably crash, but not before slowing your computer to a crawl for an hour or two. One does occasionally encounter plain-text ebooks that are so ill-formatted that this program cannot even find paragraph breaks in the text. It's rare, but it does happen. The problem is that changing the algorithm to find paragraphs in other places increases the likelihood of its putting paragraphs in the wrong place in well-authored documents. A sure-fire way to break this program is to process a long plain text document as if it were HTML (or to call a text document something.html, which amounts to the same thing).
For techical reasons described above, not all documents are convertible to other formats. Many are convertible but don't produce good results. But no document should crash the program. If you come across a supported document type that crashes the program, please let me know, including full details of the document and any error messages. My contact details are here.

I will consider reasonable requests for new features. However, please don't ask me to add support for Microsoft Office formats, as Microsoft changes them too often to keep up. Reading RTF is a possibility, if there is a demand for it.

FAQ

Q. Why do I see hash (#) signs inside of quote marks?
A. Because the input document was created by a Microsoft application that uses a Microsoft encoding format. Try `-s encoding=Cp1252'.

Q. Why can't I specify source options as well as format options in the profile file?
A. Because the optimal source settings will vary from document to document, while format settings should remain constant for a given display device or page size.

Q. Why are tables squashed into a small part of the page?
A. Probably because there are empty columns in the table. The program isn't (yet) smart enough to adjust column widths dynamically based on their contents, so even an empty column gets its share of the page

Q. Why does the text document I formatted come out as solid block of text a hundred pages long?
A. Because there aren't enough formatting cues in the document for this program to recognise where the paragraph divisions are. The program is prepared to consider blank lines, indents, and other things as hints to layout, but if none of these is present, it won't work very well.

Q. Why do I get `out of memory' errors when my system has a squillion gigabytes of RAM?
A. Because the Java Virtual Machine will only use as much memory as you tell it, however much the system has. Use the -Xmx command-line switch to allocate more memory to the java program.

Q. Why is there no graphical user interface?
A. Because I originally intended ebookconverter to be used for bulk conversion, controlled by a Linux shell script. A graphical user interface would be wholely unhelpful here. In any event, for most applications ebookconverter is trivial to use on the command line. But an equally important reason is that I hate coding user interfaces, and prefer to use the command line anyway.

Q. Why does conversion from PDF lose formatting?
A. The PDF format is designed for precise page rendering, unlike HTML which contains what are, at most, hints for the renderer. HTML is easy to convert to other formats because it is clear what bits of text are headings, lists, paragraphs, etc. This is not so for PDF. Instead, the PDF document has to be converted to text by analysing individual characters and assembling them into words and lines. There is no information in the PDF file to indicate the function of a particular piece of text. In short, getting anything other than a stream of words out of a PDF document is exceptionally difficult. This program uses various rules-of-thumb for inferring elementary page layout from the stream of words that come out of the document, but results will vary between documents, according to how they were originally authored. See the section `PDF input documents' above for more details.

Q. Why do I keep getting the message "Line 1 of a paragraph overflows the available area"?
A. This indicates that the PDF renderer is trying to fit a sequence of characters that cannot be split up into a width that cannot accommodate it. This does not usually happen with ordinary text, but is common with long e-mail addresses, and the long strings of `"' and `-' characters that are commonly used to indicate headings in plain text documents. If you can't modify the source document, you could try formatting with a smaller font. But, in the end, if it doesn't fit, it doesn't fit.

Legal stuff

The original part of this program is copyright ©1997 Kevin Boone. This part is distributed according the GNU Public Licence. In effect, what this means is that you can do whatever you like with it, provided that you acknowledge the original author, and make the source code available. The embedded software components fop, PDFBox, FontBox, BouncyCastle, and NekoHTML, and some of the open-source components that they themselves depend on, have their own licence conditions, which can be found in the licences directory of the distribution.

   
Search

WebThis site

Shameless plug

By the author of this site. Buy on-line from Amazon USA | UK

Editorial
So you want to be a university lecturer? Read this first!

Speak like your boss: new developments in managerese

Computing features
File handling in the Linux kernel: an in-depth look at how Linux handles files, filesystems, and file I/O

All sorts of Linux stuff

Confused about CLASSPATH? answers are here

First steps in EJB using jBoss (recently revised for jBoss 3.2)