txt2jpg

Version 0.1 ©2005 Kevin Boone

What is this?

txt2jpg is a simple Java utility for converting plain text files and relatively simple HTML documents to (possibly long) sequences of JPEG images. I wrote it to make it possible for me to read eBooks on my Archos pocket media players, which can view still images as well as movies. Of course, these image viewers on these media players are intended for browsing photo collections, not eBooks; and converting a plain text documented into a few thousand separate JPEG files is an outrageous use of disk space; but it works, after a fashion, when the alternative is nothing at all.

The program takes a text or HTML file as input, and outputs numbered JPEG files. You can then copy these files en masse to your photo viewer. If the specified input file has a name that ends in .html or .htm the file is formatted as HTML; otherwise it is treated as plain text.

txt2jpg is a command-line utility. It has no graphical user interface and never will have. It is just not complex enough to merit one. Please note that I have no immediate plans to extend this software to read, or produce, other formats that those it currently supports.

System requirements

Because it's written in Java, this program should run on any machine with a reasonably modern Java runtime (e.g., Sun Microsystems' JDK 1.4.2). This needs to be installed in such a way that you can run the java executable (java.exe on MS Windows) at the command prompt without getting an error message. Check this by getting a command prompt and running:
java -version
If you get a version number, you're up and running. If you get an error message, check your installation. Sorry, but fixing broken Java installations is beyond the scope of this document.

You need to have enough disk space to store the generated JPEG files -- about 5 Mb for a novel.

Installation

Download the file txt2jpg.jar and copy it to any convenient directory.

Usage

The basic command to run txt2jpg is:
java -jar /path/to/txt2jpg.jar /path/to/text/file.txt prefix
Replace the /path/to stuff with the full directory name of the place where you've installed txt2jpg.jar, and the full path to the text/HTML file you want to convert. On Windows systems you'll probably have to use \ instead of / as the directory separator. prefix is the text that will form the basis of the generated filenames -- this can be anything your operating system allows. The files will be named prefix0000.jpg, prefix0001.jpg, etc., amd will be generated in the directory in which the program is run. By default, the generated images will be 480x272 pixels in size, with a 12-point sans-serif font, with white text on a black background.

You can control the way the conversion is done using command-line switches. For example, to produce images that are 320x240 pixels, do this:

java -jar txt2jpg.jar -w 320 -h 240 file.txt prefix
Here are the other command-line options:
-c, --cscheme colourscheme    Set colourscheme (0-3). In fact, txt2jpg does
                              not use colours at all, as such; the variants
                              correspond to different shades of grey.
-f, --fontsize fontsize       Set font size (8-24, default 12).
                              Strange results will be obtained with bizarre
                              font sizes; usually 12 is the smallest that
                              can readily be read on a handheld device.
-h, --height height           Set image height (default 272).
-p, --preformat               Selects preformatted mode (see below).
                              Has no effect with an HTML file.
-q, --squash                  Use squash mode (see below). 
-s, --serif                   Use serif font. The default is to use the
-v, --verbose level           Set verbosity level (0-2);
                              Doesn't do much at present.
--version                     Show version.
-w, --width width             Set image width (default 480).
                              system's default san-serif font.

Using with Archos devices

To use this software to read books on Archos media players, I suggest that the easiest thing to do is to make a directory under the player's Pictures directory (called Photos on some units) just for converted text files. You can then connect the device as a USB disk drive, and copy all the .jpg files output from txt2jpg into this one directory. If you then open the photo browser and select the first image in the sequence, you should be able to page forward and back in the text using the right and left keys on the unit.

For best results, I suggest setting the image size using the -w and -h switches to be exactly the same as the device's screen dimensions. You may have to experiment with the font settings to get a readable display -- displays vary from device to device, so it's difficult to give general advice.

HTML support

This section describes what you can expect from this program's HTML support; essentially, not very much. Most HTML tags and entities are not processed at all. If the tag is not recognized, an error message is displayed. If it is recognized but cannot be processed, it either (a) is silently ignored (if skipping the tag won't stop text being output) or (b) produces a warning message (if it is likely that nothing meaningful will be produced). For example, the FRAME tag will produce a warning message, because most likely a document containing such a tag will contain no renderable text.

The following tags are recognized and processed:

P and /P     Treated identically -- both break to a new paragraph, and
             rest the paragraph indent, if any
BR and /BR   Break to a new line; do not reset the paragraph indent, if any
BLOCKQUOTE   rSsets the left indent the same as a paragraph indent. 
             Reset by a P or H tag, as well as a /BLOCKQUOTE
H1, H2, etc  All headings are rendered in bold, all the same size
B and STRONG Rendered in a bold font
CODE         Rendered in a monospaced font
I and U      Rendered in an oblique font
HR           Draws a horizontal line across the page, and flushes the line
PRE          Selects preformatted mode and a monospaced font (see below)
The following tags are recongized but ignored: HTML HEAD BODY TITLE META STYLE SCRIPT DIV SPAN FONT A DOCTYPE LINK PAGE FRAME FRAMESET CENTER TBODY NOBR, all table tags, all form tags, all script tags

The following entities are recognized:

nbsp         translated to plain space
quot         rendered as a double quote
gt, lt       rendered as a greater-than or less-than sign
copy         rendered as a copyright symbol
amp          rendered as an ampersand
#NNN         the unicode character NNN is inserted. Whether it is rendered
             or not depends on your JVM font capabilities
In HTML mode, txt2jpg continues to try to fit as much text into the output image as it can. However, it makes some concessions to layout. So, for example, it will try to avoid having a title as the last line on a page.

Preformatted processing

Preformatted mode is selected if you specify the -p command-line switch, or the PRE tag is encountered in a HTML document. In preformatted mode, txt2jpg tries to respect the internal layout of the document. That is, it will break lines where the source document breaks lines, respect multiple spaces and line breaks, and use a monospaced font so that each character occupies the same screen area. This mode is useful for documents such as scripts, which rely heavily on spacing and line breaks to create the proper layout.

However, txt2jpg will not respect the source document to the extent that text will be lost off the right-hand edge of the page if the line does not fit -- it will still wrap lines that are too long to fit the selected image size.

Squash mode

`Squash' mode is selected by specifying the -q command-line switch. Its purpose is to attempt to increase the amount of text that fits onto each output image. In squash mode, txt2jpg will still respect formatting information, but it will interpret it in such a way as to minimize unnecessary whitespace.

In text mode and HTML mode, the effect of squash mode is to render paragraph breaks as indents, not blank lines. This is how printed text is usually presented, but not what Web browsers usually do. The following affect HTML documents only:

Squash mode typically reduces the number of pages created by 10%-20%.

Limitations

This is a very simple program, and most likely even the simple thing it does will be done wrongly in some circumstances. Before assuming it's broken, however, please bear in mind the following considerations.

1. The goal of this program is to convert text or HTML documents for viewing on a small screen. It is not intended to be a general-purpose HTML renderer, and will not produce the same display as a Web browser even if you set the output image size to, say, 1024x768. txt2jpg tries to fit as much text onto each page as possible, while paying some heed to the formatting information in the HTML tags. It does not, and never will, render tables or images.

2. This program cannot process any kind of formatted text other than very simple HTML, or any kind of word processor document, or any kind of compressed file, or any proprietary format. It is primarily designed to read the plain ASCII text files of the type that form the main body of the Project Gutenberg library. It might handle non-latin characters if your source file is properly encoded, and you system and Java run-time are properly configured, but don't bet on it. It has incidental support for HTML, because a few eBooks are starting to turn up in HTML format, and because it's easy to most things into HTML, but don't expect too much.

3. It's much more difficult than you might think to lay out plain text on a small page. In general, the small page will accomodate fewer words than the original source documents, so the lines will have to be broken in different places. This means that we can't assume that an end-of-line marker in the source file should be treated as an end-of-line in the output -- it might just be an `incidental' end-of-line put in because text editors won't work with lines of arbitrary length. The program assumes that a blank line in the source should be taken as the start of a new paragraph. In all other circumstances, words in the source are put into the output so as to fill each line as fully as possible. This means that some layout may be jumbled. This is not a defect in txt2jpg -- it is the inevitable result of changing the page size in a plain text document. To some degree you can avoid this problem using preformatted mode, but at the expense of generating a larger number of pages.

4. Most handheld devices that can display images have a relatively low limit on the number of files that can be placed into one directory. On the Archos devices, it's 999 files. On my AV500, whose screen size is 480x272, if I use the smallest readable font size (12), most ordinary novels will produce fewer than 999 JPEG files. If you increase the font size, you may have to limit the size of the text document you process, because inevitably you won't get as much text on each page, and the number of files may go over the 999-file limit.

5. The program only outputs monochrome images. You can't highlight text, for example, in colour. The reason for this is that monochrome JPEG files are about half the size of colour JPEG files that just happen to contain mostly monochrome data.

Bugs

Legal and copyright

This software is distributed according to the GNU Public Licence. In brief, this means that the source code is available (please ask me if you would like it), and you may do whatever you wish with the software so long as the original authors continue to be acknowledged. There is, of course, no warrany of any kind.