Education and science ©1994-2003 Kevin Boone
Home     Section index     K-Zone home
Research and development projects

Site search

Articles
- So you want to be a university lecturer? Read this first!

What is statistics actually for?

More...

The K-Zone
K-Zone computing

K-Zone law

K-Zone education and science

K-Zone motorcycles

K-Zone DIY

K-Zone railways

K-Zone martial arts

About the author

K-Zone home page

 
Education and science
NLP/NLC
Web teaching
EIT
CARTMAN
QUASI
An object-oriented document preparation scheme


Kevin Boone, School of Computing Science, Middlesex University

Abstract
There is a trend towards increased use of ``what you see is what you get'' word processors and presetation graphics packages for document processing. However, the philosophy that underlies such tools can make them counter-productive for preparing and managing large sets of complex, inter-related documens. In these cases `traditional' document preparation systems based on markup languages may be more useful, as the source document can be subject to various pre- and post-processing operations. This article reviews this philosophy, and describes a prototype system for producing and managing complex document sets in the form of executable programs. Doing so allows both the document content and document processing operations to be specified in one source, leading to great flexibility and increased ease of maintenance.

1 Introduction
The modern trend in document preparation is towards increased use of ``what you see is what you get'' word processors and presetation graphics packages. Although these tools are very straighforward to use -- and to learn to use -- they have severe limitations when applied to `heavyweight' document processing jobs. As an example, consider the job of producing integrated teaching materials for a popular undergraduate course. As a lecturer I may wish to produce:

  • slides or overhead transparencies, for showing in lectures and workshops;
  • a (printed) course handbook containing all the information in the slides, plus more detailed explanations, references, sample exam questions, etc.;
  • a set of World-Wide Web pages containing all the information in the printed course handbook, plus links to other Web sites of interest, software to download, on-line tests and other supporting materials.

It can be seen that these sets of documents contain overlapping sub-sets of the same basic information, but presented in different ways. Here is how I might produce this set of documents using the modern, `WYSIWYG' approach.

    1. Using a presentation graphics program, like Microsoft's PowerPoint, design the slides, including images and drawings where necessary.
    2. Using cut-and-paste operations transfer the text and images from the slides to a word processor document, then manipulate them into a style suitable for the printed course handbook.
    3. Add text and other material to the word processor document, until it is finished. I probably have a very large (100-200 pages) document at this stage.
    4. Convert the handbook into a set of HTML (Hypertext Markup Language) files for use as web pages. Most modern word processors can do this, but I probably have to break the document into smaller documents before converion to make things more manageable.
    5. Using an HTML editor add material to the Web pages that would not have been relevant in a printed document.

Using this technique in practice, I would probably spend about the same amount of time on `creative' tasks as on `non-creative' ones. `Non-creative' in this context means the manipulation of data in such a way that no new information is added. Such non-creative tasks include the distribution of information between the various different document sets, and conversion of files from one format to another.

However, creation of the material is a relatively minor problem compared to its long-term maintenance. In the scenario above, the job of adding a small section of new information to the course material becomes that of modifying three sets of documents, in three different formats managed by three different software tools. This applies even to trivial modifications such as correcting spelling mistakes. It will often be difficult even to remember which documents have to be modified, quite apart from carrying out the modifications consistently. Because the maintenance is so time consuming, there is considerable pressure not to do it, and the overall standard of the course material degenerates, as different documents become inconsistent with one another.

Even a relatively simple application, for example producing two versions of the same document where one is a shortened version of the other, leads to difficulties in management. When a modification is necessary, both versions have to be modified consistently.

Some of these problems can be overcome by abandoning `WYSIWYG' in favour of the earlier `markup language' approach, combined with various pre- and post-processing filters. The most intractable problem -- producing consistent documents in different final formats (e.g., Adobe PostScript and HTML) -- requires a more assertive strategy. A possibility is the use of a 'meta-markup' language, that produces differently marked up text according to a higher-level specification. The marked-up text can then be processed further into printable or viewable output.

This article attempts to demonstrate that standard programming techniques, and the techniques for the management of large complex programs can equally well be applied to the management of large complex document sets, once we accept the loss of WYSIWYG tools. These techniques will already be familiar to many programmers and computer scientists; it is their application that is novel.

2 Using a markup language
The most popular markup languages are probably TEX (Knuth, 1984) -- including LATEX -- and HTML. Both use plain ASCII source text, and are therefore easy to generate or filter using software. The general strategy for using markup languages is shown in figure 1.

Figure 1: using a markup language to produce viewable output

When using TEX, for example, the marked-up source is a document containing TEX format codes, and the printable output is (ultimately) usually Adobe PostScript. The markup language processor is the TEX program itself. With HTML, the output is usually viewable, rather than printable. The languge processor is often a Web browser.

If a number of different documents contain sub-sets of the same inforation, management can be made easier by placing all the information in one source document. This source document is then filtered to extract the information relevant for each dependent document. A simple implementation is to use filters before the markup language processor (figure 2).

Figure 2: using a markup language with filters

For example, here is an example demonstrating the producing of two related Web pages from the same source, where one page is intended to be an expanded version of the other.

Hempel's <i>Raven Paradox</i> is a demonstration of the problems that
arise when trying to formalize inductive reasoning using the same techniques
that can be applied to deductive reasoning.
|<p>
|Various solutions have been proposed to the paradox, but the prevailing
|view amond philosophers now is that...

The intention is that lines starting with a vertical bar `|' will be included only in the longer document. The filters will strip the lines so indicated from one output file, and strip only the vertical bars from the other. Since this is a plain text file in simple format, it is trivial to write a program to carry out this filtering operation; however, on Unix systems at least there is no need. Unix has a rich set of filter utilities which can be pressed into service for this simple job. For example, the following two commands will generate the `long' and 'short' versions of the Web from the original source, which in this example is called `page1.html'.

grep -v "^|" page1.html > page1short.html
sed -e s/^\|// page1.html > page1long.html

So what has this acheived? The crucial point is this: to ensure that both the `long' and the `short' documents are consistent with one another, it is necessary only to update the source document, and execute two Unix commands. And as these commands can be placed in a script, updating becomes a one-command operation. Even better than the use of a script is to use a `make' file; this will ensure that whenever any source document is modified, exactly the needed operations are applied to bring their dependents up to date. This operation should be very familiar to programmers. For example, the following listing is of a `make' file that applies the two filter operations described above only if the source document was modified since the last `make'.

Like programming languages, document markup languages vary in the depth of abstraction they offer. For example, HTML and TeX have little abstraction: they focus on the detailed appearance of individual words and characters. Towards the other end of the scale are systems like LaTeX (Lamport, 1986) that are concerned with document layout and structure more than with typesetting. For day-to-day use high-level markup languges probably offer greater ease of use and a more readable source.

all: page1long.html page1short.html

page1short.html: page1.html
     grep -v "^|" page1.html > page1short.html

page1long.html: page1.html
     sed -e s/^\|// page1.html > page1long.html

The use of `make' files becomes increasingly powerful as the number of sources increases, and when the sources share components (e.g., chapters or text sections).

When used with markup languages like TEX and HTML, simple filters like the one described above allow some effects that are difficult to acheive, and even harder to maintain, with WYSIWYG tools. Some examples include:

  • including program listings in documents (such that the document is updated automatically if the program changes);
  • including the output from other programs in a document;
  • including a list of references at the end of a document based on citations in the text;
  • generation of form letters where the information to be included is derived from, say, a database and depends on the recipient.

3 The ADG system
Although it is quite contrary to modern trends in document prepration, an increase in flexibility and ease of maintenance can be achieved using a programming language to specify the document content. If necessary, language statements can carry out processing operations in the same source, thus combining the processing and the specification of text. The ability to produce on-line viewable and printable documents from the same source can be improved using a 'meta-markup' language. ADG is a prototype system document preparation and management system based on these philosophies.

3.1 Overview
ADG (all-purpose document generator) is a prototype system for specifying (and perhaps processing) documents in C++, an object-oriented programming language. In its current form, ADG produces two types of output: TEX (and thereby PostScript for printing) and HTML (for Web pages). In its simplest mode of operation, it will take a marked-up source document and produce a TEX document and a corresponding HTML version. Of course the facilities offered by TEX and HTML are not identical, so exact equivalence between the two output formats is impossible; however, they are close enough for most purposes. In a sense ADG is a meta-markup language processor, as its input and output are both markup langauges. However, as it is based on a `real' programming language, it is extremely flexible. In particular, it is able to carry out the job described in the introduction, that is, the production and maintenance of several types of teaching materials in several different formats from one source.

The use of a programming language for producing documents is not new; indeed TEX and PostScript have some characteristics of programming languages, such as variables and functions. However, C++ is a true, general-purpose language, with all the flexibility and power this entails. Being object-oriented, C++ is an ideal vehicle for specifying documents which contain `entities' such as figures, tables and lists. It is very easy to create sub-types of the built-in classes to provide entities with behaviour different to the default.

3.2 ADG components
The ADG system consists of a syntax converter, and a set of C++ class libraries. There is one class library for each output format, but in fact most of the class library does not depend on the output format, and only a small proportion of the code in the library needs to be re-written if new formats are added. The generation of the output files occurs when the ADG document is executed (as a C++ program), so a C++ compiler is also required. The steps in generation of documents from an ADG source are shown in figure 3. Of course, utitilies are provided to automate this process and make it transparent to the user. However, it is envisaged that in a substantial documentation project the author would create `make' files to automate these tasks. This has the advantage of being faster -- when properly configured -- as only the essential processes are applied to each component at any time.

Figure 3: steps in the ADG process

The syntax converter is used to overcome the fact that C++ is not a naturally expressive language for text management. For example, the following line shows how italic and bold formatting is applied to sections of text in an ADG document: 

@italic{this section is italicized} and @bold{this section is bold}.

 

This is translated by the syntax converter into `real' C++ code, which is what the author would otherwise have to enter:  

italic((string)"this section is italic")+(string)"and"+bold((string)"this section is bold".)

 

In an ADG document the syntax converter is enabled for text between pairs of dollar signs. Outside of these regions, any normal C++ statements can be used. Of course, the functions `italic' and 'bold' in the example above have to be supplied from somewhere; this is the job of the class library.

After syntax conversion and compilation, the class library for the desired output format is linked in. At present, the class library contains classes for different types of document, tables, figures, bulleted and numbered lists, text styles, tables of contents and verbatim program listings.

3.3 A simple ADG document
The example below shows a very simple ADG document.

main()
{
Document D;
D.add ($
Here is some plain text. @italic{Here is some italicized text}.
$);
}

Note the 'add' operation on the Document object `D'. This operation is overloaded to accept strings (e.g., plain text) and other objects such as figures and tables. In implementation, figures, tables and other similar entities are all derived from an abstract base class '`DocObject' which provides the features that all these things have in common, such as captions, and some spacing above and below to improve layout.

3.4 ADG Document classes
ADG provides several built-in document classes, and it is easy for the end user to extend them or introduce new ones using standard C++ constructs. The basic, plain document is a class called `Document', as used in the example above. This is specialized into, for example, `DocumentWithFrontMatter' which has a title page and table of contents. `BookDocument' is a type of DocumentWithFrontMatter, but is designed to be printed double-sided, so it has alternating page numbers and a binding offset. All these features are generated through virtual operations, so a new derived class can implement its own variants.

3.5 ADG entities
The basic ADG document entities are tables, bulleted and unbulleted lists, figures, and verbatim listings. These are all derived from an abstract base class which controls the routine operations that are common to all these entities. Entities receive there numbers from the Document object, which keeps a count of the number of entities of each type that have been added. This means that each entity must have a name, e.g., `figure' or `listing', so that the numbering of these various entities proceeds in the correct sequence. The Document object keeps track of the numbers by maintaining a count of the number of times each name has been supplied by an entity. This is not very elegant, but there did not seem to be a better way that did not compromize the encapsulation of objects. Fortunately it is all invisible to the casual user. Provided that new document entities are dervived from the existing virtual base class, it is mostly invisible to more advanced users as well.

3.6 Advantages of ADG
It is as easy to specify a document in ADG as it is in TEX or HTML -- or easier -- but it provides the power and flexibility of a compiled, object-oriented language. As documents can by thought of as collections of objects, some of which are aggregates of other objects or sub-types of them, the object-oriented approach is highly appropriate. Object orientation makes it very easy for the end user to extend the system without causing interference to built-in features.

There are other ways to get both printed and on-line output from a single document; for example, there are various filters which attempt to covert LATEX to HTML and vice-versa. These work with varying degrees of success. They tend to resort to producing bit-maps for features that cannot be directly rendered by the target system, which is wasteful of computer resources and slow when on-line. ADG recognizes the spirit of document formatting, it produces output which may not be exactly what the author intended (e.g., 28.5pt font) on all targets, but will be close enough for most purposes and as efficient as possible.

3.7 Further work
The main limitation of ADG at present is its need to produce printed output via TEX, rather than directly to PostScript, for example. The reason for this was simply to avoid re-implementing all the things that TEX does rather well, like line-breaking and page-breaking. However, ADG has no access to page numbers (as these are generated by TEX after ADG has finished), making it difficult to give correct page numbers in tables of contents and cross-references. The solutions currently adopted are too inelegant to discuss. A future goal should be the ability to produce PostScript directly.

Eventually it would be useful to implement a graphical front end to ADG.

4 Summary
There are many good reasons to use `traditional' text processing techniques based on markup languages (Knuth, 1984). Not least among these is the ability to manipulate the source under the control of a program. By doing so it is often possible to arrange for a set of related documents to derive from one source, making maintenance much easier. ADG takes this a step futher, by implementing a document specification scheme in a true compiled programming language. This makes it possible to combine the specification and the processing of the text in one source, leading to greatly increased flexibility.

Acknowledgements
I would like to thank Dr Howard Goodman for his assistance with the finer points of the TEX language.

References
Knuth D (1984) ``The TeX Book'', Addison-Wesley, New York

Lamport L (1986) ``LaTeX: a Document Preparation System'', Addison-Wesley, New York