|
©1994-2007 Kevin Boone | ||||||||||||||||||||||||||||||||||||||||||
|
Home > Downloads > Java downloads
ebook converter
Last modified: Mon Feb 19 12:06:37 2007
Version 0.1a, Copyright ©2007 Kevin Boone Distributed under the terms of the GNU Public Licence What is this?ebookconverter is a simple utility for converting ebook
documents in HTML, text, and PDF formats from one of these formats to another.
By `ebook documents' I mean long documents whose main component is text,
although simple formatting is supported. Another characteristic of
ebooks, as documents, is that a presentation that is easy to read, and makes
effective use of the screen or display area, is more important than
precise layout.
In particular, this software is intended to facilitate the reading of ebooks on
hand-held computers and media players, which usually support only a very limited
range of document formats, and have small screens (small, that is, compared to
printed paper). For example, the Archos x04 range (404, 504, 604) of portable
media players have small (2"-3") screens, and can read only PDF
format. While it is, of course, possible to convert documents to these
formats using modern word processors, word processors are not usually aware
of the constraints of small screens. In particular, they do not usually
re-flow text to use all the available screen area. This program does.
What's more, because this program is entirely driven by command-line
operation, it can easily be embedded in scripts or batch files, and used
to do bulk conversion -- something that is not practicable with a
word processor.
ebookconverter is written
entirely in Java, and should run on most modern computer platforms. It has
mostly been tested under Linux. It is intended to run under Sun's JDK 1.5 Java
run-time engine, or something very similar. ebookconverter
makes use of a substantial number of
open-source components from the Apache project and elsewhere
-- see below for legal and
copying information.
HTML support in this program is fairly rudimentary. Only basic formatting
tags are recognized. As with plain text, the program works on the assumption
that the document is to be read on a small screen, and adjusts the layout
accordingly. Please remember that the main goal of this program is
to get something readable on small screens. If you're looking for
maximum fidelity
between the source and desitination document formats, this program is
not for you.
Getting the softwareThe source and binary package can be downloaded from here. If you unzip this archive you will find, among other things, a fileebookconverter.jar,
which is the only file you need to run the application.
PrerequisitesTo useebookconverter you need JDK1.5.0 or later, or something
compatible, and the java run-time engine needs to be runnable from the command
line. That is, if you do:
% javaon a Unix system or c:\> javaon Windows you should get a Java usage message, and not an error message of some kind. If you need to enter the full path to java that's OK,
but it will be less convenient.
I have done most testing with JDK 1.5.0_04. I would expect
anything later than this to work, but I can't promise. I haven't
tested with any Java implementation but Sun's, and I don't propose to.
It may work; it may not. This program will very likely not work with JDK 1.4.x
JVMs, even if you recompile it to suit, because of the way the embedded XML
parser has changed between 1.4.x and 1.5.0.
PDF generation is a memory-intensive process. Memory requirements vary
according to the size of the document but, for very long documents
(more than a few hundred pages), you
may find that up to a gigabyte of memory is required. Of course,
this need not be real, physical RAM; but if most of it isn't, you can expect
a certain amount of hard disk thrashing.
InstallationThe entire application is contained in a single file:ebookconverter.jar. Please copy this file to any convenient place. I would suggest
/usr/lib on Unix systems, and \windows\system32\
on Windows. The choice of directory is entirely arbitrary, but if the
name is quick to type, that will probably be more convenient. I will assume
in what follows that you have installed in /usr/lib.
The archive is supplied with a number of predefined formatting profiles
to suit the Archos x04 personal media players. There are in the
properties directory. If you wish to use these, they can be
copied to any convenient place.
Basic usageebookconverter has a great many command-line options for
fine-tuning its operation. These will all be described in detail below.
However, the simplest mode of operation is:
java -jar /usr/lib/ebookconverter.jar [input_file] [output_file]You will need to replace /usr/lib with whatever directory you
installed in. input_file is the name of an existing text,
HTML, PDF, or XSL-FO file.
output_file is the name of a text, HTML, or PDF file which may,
or may not, exist.
If it exists, it will be over-written without warning.
In this mode of operation, the program infers the document types from
the filename extensions.
Unless told otherwise, it assumes that HTML files have names that
end in .htm or .html, and text files have names that
end in .txt or .text. You can override this
assumption if necessary (see below for details).
Unless told otherwise, the program will produce output suitable for
printing on A4 paper. For any other document or screen size, you will need
to modify the operation with command-line switches or profile files.
Conversions supportedebookconverter converts between text, HTML, and PDF files.
It can also convert from, but not yet to, XSL-FO files. XSL-FO is a relatively
new standard for presenting formatted text.
Not all these conversions produce staggeringly impressive results, for
reasons that will be described below. ebookconverter
can also convert files into
different instances of the same format. Why? Consider the Archos
604 media player, for example. It can read only PDF documents, and has
a screen about 150mm wide by 100mm high. A PDF document formatted for,
say, A4 paper will be readable on this device, but not very well.
This is because if you scale the document so it fits on the screen,
the text will be too small to read. And if you don't, you'll have to
pan around the document to get from the beginning of a text line to the
end. You'll be able to read it, but not comfortably, and certainly not
quickly.
However, if you convert the original PDF to a new PDF using
ebookconverter, and set the output document size to
150mm x 100mm, you'll get a document whose text fills the whole screen
when viewed on the Archos device, and which fits the screen exactly
-- no zooming or panning required. Such a document will be much
more comfortable to read.
Of course, you'll lose all the layout and most of the formatting information
in the original file. There are both technical and pragmatic reasons
for this. The technical reasons are discussed later in this document;
the main pragmatic reason is that it is essentially the layout information
in the original file that stopped it fitting the screen in the first place.
Another example: some ebook viewers for handheld devices expect plain text
files to be formatted to the correct line length to fit the screen.
This program can take a text
file which is formatted to, say, 80 columns per line, and convert into a
document with 30 columns per line, to suit a smaller display.
The rest of this section describes what can be expected of the particular conversions this software performs. Text input documentsPlain text (traditionally in 7-bit ASCII representation, but more recently in the various Unicode encoding formats), is a ubiquitous way of storing documents. Nearly all computers can render (display) text documents. However, such documents are normally formatted on the assumption that a fixed-width font will be used, and that the width of the screen display or paper is known with respect to the font width (typically 70-80 characters across). It is not easy or comfortable to read a fixed-pitch font for a long period of time, and if the width of the paper or display device happens not to match the width of the document it will either be truncated, or not fill the space effectively. When printing on paper, the fact that there are large amounts of white space is not usually a problem (except for the trees), but if you want to read a document on a hand-held computer with a small screen this is a big deal -- both truncation and wasted screen area are likely on many documents. This means that, to read a long plain text document painlessly, it is usually necessary to restructure it to fit the paper or display device. This is much less straightforward than it sounds, because there are few layout clues in the document.ebookconverter reformats plain text documents with some understanding
of the way that such documents are usually authored. For
example, it assumes that a blank line between two lines of text is to
be treated as a paragraph separator, and that a line of `-' or `=' signs
under a line of text denotes that line as a heading. However, because
there are no standards in this area, it is possible for the program to
get the layout badly wrong. The algorithm the program uses allows for some
tweaking, and it may be necessary to experiment with the settings
to get reasonable results.
HTML input documentsHTML does not suffer the same problems as plain text, because layout hints are given in the file. However, in practice, HTML suffers from two other problems: it is often badly-formed, and many authors use HTML tags not just to hint at the layout, but to specify it to the millimetre.ebookconverter
makes some attempt to process badly-formed HTML, but results will vary. As for
over-specification of layout, the program tries to get around that problem by
ignoring layout, size, and positioning information that makes no sense for a
small display (i.e., most of it).
Please bear in mind that this program is designed to convert long but,
essentially,
simple HTML documents -- that is,
HTML the way it is typically used in e-books. It is
not intended to be a fully-featured HTML-to-PDF converter. Most HTML tags and
styles are ignored altogether. The following tags have at least some meaning to
the program: B, BLOCKQUOTE, BR, CODE, DIV, DD/DD/DL, EM, FONT, I, IMG,
H1, H2, H3,
H4, HR, P, PRE, OL/LI, STRONG, TABLE/TBODY/TD/TR, TT, UL/LI.
Even these tags have limitations. Some of the most obvious ones are as follows:
-f default-alignment=left.
PDF input documentsThe PDF format is designed for precise page rendering, unlike HTML which contains what are, at best, hints for the renderer. Provided that the author hasn't put precise layout instructions into the docuement, HTML is easy to convert to other formats because it is clear what bits of text are headings, lists, paragraphs, etc. This is not so for PDF. Instead, the PDF document has to be converted to text by analysing individual characters and assembling them into words and lines. There is no information in the PDF file to indicate the function of a particular piece of text. In short, getting anything other than a stream of words out of a PDF document is exceptionally difficult. This program uses various rules-of-thumb for inferring elementary page layout from the stream of words that come out of the document, but results will vary between documents, according to how they were originally authored. Some PDF documents cannot be converted at all. At present there is no support for encypted documents, although such support could probably be added if there was a demand for it. But, more often, conversion fails because there is, in fact, no recognizable text in the document. It isn't all that uncommon to find PDF files that have been produced by scanning paper documents and inserting them into the PDF as large bitmps. The PDF document may look like it contains text, but the deception becomes apparent when you try to scale the document -- it will look hideous. In any event,ebookconverter will not be able to extract
the text from such a document, because there isn't any.
Conversion summary
Command-line switchesebookconverter accepts a large number of command-line switches,
to specify source processing and destination formatting. The defaults
work adequately for many situations, but if you're producing PDF for a
handheld device, you will at least need to specify the document size.
Getting a document size to fit the screen, while giving readable text,
will usually be a matter of trial
and error.
Frequently-used formatting options can be placed in profile files for future
use, as described later.
The basic format of the command line is:
java -jar ebookconverter [-s source_options] [-f format_options] [-p profile_file] {input_file} {output_file}
For example:
java -jar ebookconverter \ -s encoding=utf-16,respect-linebreaks \ -f width=150mm,height=90mm \ my-text.txt my-pdf.pdf(all on a single line, if you're using Windows)
The The source-options and format-options switchesThe general format of these switches is:-s flag1,flag2,name1=value1,name2=value2... -f flag1,flag2,name1=value1,name2=value2...Note that there must be no spaces between the comma-separated options. Any number of options can be given -- some take values (e.g., -s encoding=utf-16),
others are merely flags (e.g., -s respect-linebreaks).
All -s options must be given in a single block. You can't, for
example, say -s... -s....
source-optionsThe source-options settings control how the source document is parsed. The defaults will work for many documents, but better results can often be achieved by tweaking.respect-linebreaks-s respect-linebreaks,...If specified, ebookconverter treats all end-of-line markers in the
file as indicating a line break in the output document. This option
is only valid for text files -- HTML files never contain meaningful
line breaks, except within PRE (preformat) tags, where they are also
respected.
Without this option, the program ignores lines breaks, and attempts to
work out the document structure itself. This nearly always produces better
results. However, in some documents the line breaks are highly significant,
particularly poetry, plays, and program listings. With such documents you
may have no alternative than to enable respect-linebreaks.
The consequence of doing so is that you will end up with ragged lines
and lots of white space at the ends of lines. This does not make efficient
use of the screen on small displays.
type-s type=text|html|pdf|guess,...Specify the file type. The default is `guess', in which case the program uses the filename extension to guess the contents. ignore-emphasis-s ignore-emphasis,...If this option is set, the program ignores emphasis in text documents, such as *bold* and _italic_. This may be necessary if the document uses these symbols for something else. This option also forces the program to ignore layout cues such as rows of `=' signs to underline a heading. force-new-para-on-tab-s force-new-para-on-tab,...If set, any tab indent from the left margin of a text document indicates a paragraph break. If not set, documents which appear to use a tab this way are assumed to do so. encoding-s encoding=utf-8|Cp1252|...,...Traditionally, text documents were produced in 7-bit ASCII format. Since there are usually 8 bits to a byte, many computer and software vendors came up with their own extensions that used 8 bits, gaining another 127 characters over the ASCII standard. More recently, multi-byte encoding has become common, particular the Unicode standard. Unless otherwise specified, ebookconverter assumes that input
documents are in Unicode UTF-8 format. UTF-8 is a multi-byte format but,
as the first 127 characters are the same as in ASCII, plain ASCII files
will be read as well. You may also be able to process documents with
other single-byte character sets in this mode, but the results will be variable.
If possible, it is better to specify the correct character encoding,
whether you are using a single-byte or multi-byte document.
In general, there is no way to determine the character encoding from
inspecting a plain text document. Not realiably, anyway, although some software
has a fair guess. ebookconverter requires the encoding to be
specified explicitly, or for you to hope for the best.
HTML documents sometimes specify the encoding, like this:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html;charset=utf8">If this tag is present, the encoding specified there takes priority over the command-line option. This program does not process character encodings itself -- it just uses the Java language features. A full list of the character encodings supported the Sun JVMs can be found on Sun's web site. All Java implementations will support US-ASCII, UTF-8, and ISO-8895-1. The latter is a very common, single-byte extension to ASCII that has Western but non-English characters in its higher-numbered positions. If you read an ISO-8859-1 document as if it were UTF-8 (the default) you can expect some minor character transpositions. In many cases these will not be irksome. A particular problem for document conversion is the Microsoft variant of ISO-8859-1, which these days is known as `Cp1252'. A great deal of Microsoft software, notably the `Office' products, produce output in this format if asked to write a plain text file. Cp1252 uses some extra punctuation characters, such as a reverse apostrophe. These characters exist in UTF-8, but did not exist in ISO-8859-1. If you see odd characters in the PDF output where you expect apostrophes or quotation marks, most likely you have a Microsoft text document. This form of encoding does turn up in HTML files as well as plain text documents. It shouldn't really, but it does. Try -s encoding=Cp1252.
format-optionsFormat options control how the output document is formatted. These options can all be specified in a profile file as well as on the command line (see below). Many of these options are measurements. Measurements can be specified in real units (mm, cm), points (pt), or `ems'. An em is the approximate width of the letter `m' in the current font, and is useful when you want a size that will scale with the font size. In general, there are no default units, and you should always specify the units explicitly. The exception istextwidth (see below),
which is measured in characters.
allow-orphan-titles-f allow-orphan-titles,...If this option is supplied, a title is allowed to be the last line on a page. Normally the program will shift titles onto the next page in this situation, because such an `orphaned title' looks ugly. However, when formatting for small displays, the amount of whitespace created by pushing titles forward may be unhelpful, and lead to too much page-turning. This option only has much effect with HTML files, because the program may be unable to prevent orphaned titles with plain text files -- the distinction between titles and text in such files is often very unclear. block-space-f block-space=0.5em,...The amount of space added below, and sometimes above, each block of text. A block is a paragraph, table, list, or table. Specifying the size in ems makes it relative to the current font, and therefore adjusts automatically for font size. For very small displays, you might need to set this as low as 0.1em. For printing, 0.5em-1.0em works best. default-alignment-f default-alignment=left|justify|right,...If unspecified, text is justified (that is, fit to the left and right margins) by default when output as PDF. An HTML file can override the default, but note that there is no explicit way in HTML of requesting justified text. That is, you can't convert a text or PDF file to HTML and get justified output, because HTML does not support it. Most people prefer to read text which is justified. In general, you won't get more (nor less) text on the screen that way -- it's just a matter of reading preference. font-family-f font-family=[name],... e.g., -f font-family=Times,... ebookconverter supports only the three standard letter font families
that are required by the PDF Specification to work with all PDF
viewers. These are `Helvetica', `Times', and `Courier'. In principle, 'Symbol', and
`ZapfDingbats' are also supported (as per the PDF Specification), but
these contain few letter shapes, and so are unlikely to be useful.
Note that font family names are case-sensitive. The progam will
output a warning message if given the name of an unsupported font
family. You can use the names serif and sans-serif
as shortcuts for the default serif and sans-serif fonts respectively.
In principle, the FOP rendering engine (see below) which ebookconverter
uses to render the final output does support other fonts than the basic
set. However, the mechanisms for doing this are platform-dependent, and I
did not want to introduce platform-dependencies into this document.
Users who want a larger selection of fonts should refer to the
Apache FOP documentation, available from the
Apache Web site.
font-size-f font-size=[name],... e.g., -f font-size=8pt,...Specifes the size for the base document font, in points. Point sizes from 6 to 12 usually work best. You'll get more on screen with a smaller font size, because the program will try to fill as much of the document as it can. But, of course, small fonts are harder to read. The Archos units handle fonts down to 6 point, but this is a strain on the eyes. 8-10 point sizes seem to work best. Headings are displayed in a larger font of the same typeface. The default base font size is 10pt. margin-f margin=[size],... e.g., -f margin=20mm,...The margin setting is applied equally to the top, bottom, left, and right margins. These cannot be set independently (yet). For screen viewing, a margin of a few millimetres is usually enough. You could set it to zero, but this makes for surprisingly uncomfortable reading. no-postproc-f no-postproc,...Do not post-process text after it has come out of the transformation from intermediate XHTML (see the section `Technical details' below). This option is only useful if you are using a custom stylesheet (see the stylesheet) option to produce text output.
Normally, disabling post-processing will result in very ugly text, with
layout hint codes embedded in it, which won't be very useful.
page-width and page-height-f page-height=[size],page-width=[size],... e.g., -f page-height=100mm,page-width=150mm,...These switches specify the size of the output document. This `document size' is a notional measure when formatting for reading on a screen, particularl a small screen, and you won't necessarily get a document that is a perfect fit to the screen by measuring it. But that should, at least, provide a start, which you can refine by trial-and-error. If the document is a perfect fit to the screen, then you should not have to scroll to see different bits of each page. One screenful will be equal to one page. This is particularly important on hand-held devices, because scrolling is often not very fast. For Archos devices, the sample profiles archos604, etc.,
have document sizes which fill the screen exactly, or as near exactly as
I was able to determine.
The default page size is A4.
stylesheet-f stylesheet=[filename],...Specify an XSL stylesheet to be used in the transformation of the XHTML which results from parsing the input file, into whatever output is required. Stylesheets are included for the basic transformations that the program supports, although some could (putting it mildly) be improved. For PDF output, the stylesheet must take well-formed XHTML and produce XSL-FO. For other output formats, the stylesheet can produce whatever output is required (or possible). textwidth-f textwidth=[width],... e.g., -f textwidth=60,...Specifies the number of characters wide to make plain text files. This setting is only meaningful with plain text output -- with PDF output you need to set page-width, page-height and
margin. With HTML, the text width is irrelevant, as text is
reflowed by the browser to suit the screen size.
twocolumn-f twocolumn,...Selects two-column output. This option only works with PDF output. type-s type=text|html|pdf|fo|guess,...Specify the file type. The default is `guess', in which case the program uses the filename extension to guess the contents. ProfilesTo avoid entering a large number of formatting paramters on the command line,ebookconverter supports profile files. A profile file contains settings of
the form
name=value name=valueThe options set in a profile file can be any valid option that would apply to the -f switch described above. To use a profile,
specify the filename with -p, eg.,
java -jar ebookconverter.jar -p profiles/archos604 in.txt out.pdfOptions entered on the command line take priority over those in the profile if both locations are used. This means that you can use a profile file, and override only certain options on each operation. Technical detailsebookconverter does not contain a huge amoung of original code:
it merely bolts together a number of existing open-source components.
In particular, it uses the Apache FOP library to generate PDF, PDFBox for
stripping text out of PDF Files, Andy Clark's NekoHTML for fixing up broken
HTML files before parsing, Xerces for parsing XML and well-formed HTML,
Xalan for transforming XML from one format to another, and a number of
other components.
ebookconveter uses XSL-FO as an intermediate format for
PDF generation. XSL-FO is a (relatively) new
standard for describing printable documents, similar in concept
(but not in implementation) to TeX and LaTeX. Files destined for PDF output are first transformed into
well-formed XHTML (which involves a certain amount of guesswork), and
then via an XSL transformation into XSL-FO. The XSL-FO is then rendered
into PDF by FOP.
Files destined for plain text or HTML output are first transformed into
XHMTL, as above, then via another XSL transformation into the output format.
Plain text output is then post-processed to tidy up the output of the
XSL transformation, which is not designed to produce neat, readable
text.
This multi-stage process sounds complicated and, indeed, it is. But my
intention is that by doing things this way, it will be easier to add
support for other file types in future.
Processing arbitrary XML filesThe use of proprietary XML markup for ebook-type documents is rare, but not unheard of.ebookconverter is able to convert XML to PDF,
plain text, or another XML format, if the program is supplied with a
suitable XSL stylesheet to effect the translation. An example of this
usage may be found in the samples directory of the source
code distribution. To convert to PDF, the transformation should produce
XSL-FO; otherwise it can produce what output is desired.
Creating a suitable stylesheet for anything but a trivial XML format is
a demanding undertaking, and one that requires considerable expertise in
XSL and (for PDF output) XSL-FO. Such matters are beyond the scope of this
document.
Bugs and limitationsThose I know about include the following:
FAQQ. Why do I see hash (#) signs inside of quote marks?A. Because the input document was created by a Microsoft application that uses a Microsoft encoding format. Try `-s encoding=Cp1252'. Q. Why can't I specify source options as well as format options in the profile file? A. Because the optimal source settings will vary from document to document, while format settings should remain constant for a given display device or page size. Q. Why are tables squashed into a small part of the page? A. Probably because there are empty columns in the table. The program isn't (yet) smart enough to adjust column widths dynamically based on their contents, so even an empty column gets its share of the page Q. Why does the text document I formatted come out as solid block of text a hundred pages long? A. Because there aren't enough formatting cues in the document for this program to recognise where the paragraph divisions are. The program is prepared to consider blank lines, indents, and other things as hints to layout, but if none of these is present, it won't work very well. Q. Why do I get `out of memory' errors when my system has a squillion gigabytes of RAM? A. Because the Java Virtual Machine will only use as much memory as you tell it, however much the system has. Use the -Xmx command-line
switch to allocate more memory to the java program.
Q. Why is there no graphical user interface?A. Because I originally intended ebookconverter to be used for
bulk conversion, controlled by a Linux shell script. A graphical user interface
would be wholely unhelpful here. In any event, for most applications
ebookconverter is trivial to use on the command line. But an
equally important reason is that I hate coding user interfaces, and prefer to
use the command line anyway.
Q. Why does conversion from PDF lose formatting?A. The PDF format is designed for precise page rendering, unlike HTML which contains what are, at most, hints for the renderer. HTML is easy to convert to other formats because it is clear what bits of text are headings, lists, paragraphs, etc. This is not so for PDF. Instead, the PDF document has to be converted to text by analysing individual characters and assembling them into words and lines. There is no information in the PDF file to indicate the function of a particular piece of text. In short, getting anything other than a stream of words out of a PDF document is exceptionally difficult. This program uses various rules-of-thumb for inferring elementary page layout from the stream of words that come out of the document, but results will vary between documents, according to how they were originally authored. See the section `PDF input documents' above for more details. Q. Why do I keep getting the message "Line 1 of a paragraph overflows the available area"? A. This indicates that the PDF renderer is trying to fit a sequence of characters that cannot be split up into a width that cannot accommodate it. This does not usually happen with ordinary text, but is common with long e-mail addresses, and the long strings of `"' and `-' characters that are commonly used to indicate headings in plain text documents. If you can't modify the source document, you could try formatting with a smaller font. But, in the end, if it doesn't fit, it doesn't fit. Legal stuffThe original part of this program is copyright ©1997 Kevin Boone. This part is distributed according the GNU Public Licence. In effect, what this means is that you can do whatever you like with it, provided that you acknowledge the original author, and make the source code available. The embedded software components fop, PDFBox, FontBox, BouncyCastle, and NekoHTML, and some of the open-source components that they themselves depend on, have their own licence conditions, which can be found in thelicences directory of the distribution.
|
|
|||||||||||||||||||||||||||||||||||||||||