Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Sunday, July 23, 2006

Composition Call To Arms

While I've spent most of my carreer thinking about XML content management and link management in one way or another, for the last five years or so my focus has been primarily on automated composition of XML documents. For most of that time I've worked on applying XSL-FO to challenges of composing technical documents written in a large number of different national languages (along with some of the authoring and data representation challenges that go along with it).

For the last year or so I've been working on a new Innodata Isogen initiative called the Tools Agnostic Layout System (TALS), whose primary goal is to provide a generic page layout style mechanism that can then be used to generate renderers for a variety of back-end composition systems. We are not implementing composition functionality directly, rather we're adding another layer of abstraction above what you get with tools like XSL-FO or commercial XML-aware composition systems.

Our original target was reduction of the engineering cost of creating XSL-FO generation systems using XSLT, while also formalizing the practice and reducing the cost of capturing detailed formatting requirements. However, driven by our pilot customer, our focus changed from XSL-FO to automation, as much as possible, of the composition of high-end documents such as textbooks.

In order to do this I've been implementing a "renderer generator" that takes our style sheets (what we call "master format specifications") and generates from them XSLT programs that then generate files for a specific composition engine.

This engine, which I will not name, provides powerful composition features required to do things like textbooks, features like vertical justification and sophisticated hyphenation control (which in turn requires sophisticated line layout algorithms). It provides features for intelligent placement of floated objects.

This engine is also a very old piece of technology that reflects a time before object-oriented methods or syntaxes with more than 8-letter keywords. It is poorly documented. It is clearly a decades-old accretion of features with little overarching consistency of design or convention. It has many annoying bugs and implementation shortcomings that speak to weak overall engineering. It's control language syntax is obtuse and impenatrable.

Nevertheless, it is one of only two or three products that can do what we need to do. It has a lot of market penetration and a large install base. So we are using it.

But as a side effect of this experience, I've come to to realize that the world needs something that does not today exist (at least I haven't seen any hints of it): a modern, open-source, well-engineered, full-featured composition engine.

The business problem I'm seeing is that high-quality largely-automated composition is a functionality that many enterprises need, obviously mostly large publishers, but the cost of using the existing tools is prohibitive, both in terms of the raw license costs and in terms of the labor cost needed to use those tools.

High-quality composition seems to be one of the last major areas where there is no good open-source solution that can be integrated into the rest of the XML support infrastructure.

It's no surprise why this is: it's a very hard problem, there are existing tools, the primary user enterprises are, by their nature, not quick to seek new processes when old ones work well enough, and composition has never been such a big fraction of the cost of doing books or magazines to make finding a less-expensive solution that compelling. But I think that is changing in large part because of the disruptive effect that the Internet itself is having on the book publishing world.

I think that the time is right for the development of an object-oriented, general-purpose, open-source composition engine designed from the ground up to satisfy the typesetting requirements of the most demanding documents.

For the purposes of this call to arms I will focus on textbooks as the target application, in particular high school and college textbooks. That is because textbooks appear to present the most challenging requirements within the set of requirements that can be met (or mostly met) using automated composition from XML. Some very heavily designed documents, such as magazines or marketing material, are too ideosyncratic to be practically composed automatically--it requires too much hand work. But textbooks can be automated close to 100% in most cases (depending of course on design choices).

What I have is a pretty deep knowledge of these requirements and the issues inherent in automating them. For example, many textbooks use sidebars and similar marginal material. These present a serious automation challenge because they must be positioned relative to each other using various algorithms and rules of thumb that invariably involve some aesthetic choices ("space them equally on the page vertically unless they are too tight and then do blah blah blah").

I also know that these problems are solvable to one degree or another because software exists in various forms that can do it.

I also know that both software engineering techniques and tools as well as computers themselves have improved dramaticaly in the 30 years since many of these tools or their underpinnings (usually TeX) were first developed. This means that many things that we would now identify as premature optimizations but that at the time were simply the only way to make it work on affordable (or even any available) hardward are no longer necessary, at least not in the first iteration. I know for example that the guys at RenderX built an FO-capable composition system from the ground up in a reasonable amount of time. They're smart guys but not super geniuses (at least not as far as I know) so if they can do it we should be able to to.

I think that, given a reasonable amount of design work that it should be possible to design a composition system architecture that will satisfy the composition requirements of textbooks. I think that given the architecture implementation should be fairly straightforward, although it will not always be easy because there will assuredly be lots of wrinkles that can't be anticipated in the design. But we know what the business problem is, we know what the result needs to be, and the basic techniques of automatic typesetting are well established and well documented, thanks to Dr. Knuth. In addition, over the years a number of libraries have become available that handle a lot of the details, such as working with fonts and font metrics, doing line layout, rendering vector graphics, and so on. That should all reduce the total engineering cost considerably over what it would have been even just five years ago.

So it should be doable.

Note that this type of activity is not one where you can start of simply and iterate your way towards greater sophistication (as you can, for example, with XML-aware content management). This is because the software architecture and implementation techniques that will solve the hardest problems will differ quite markedly from solutions that will satisfy less demanding requirements. This is clear from looking at existing XSL-FO solutions. XSL-FO's abstract architecture is explicitly defined so that it avoids the hardest composition problems, those that require feedback during the pagination process into the initial layout process. Thus XSL-FO systems can be significantly simpler in this area than more complete composition systems.

So the only way to really proceed is to start with the hardest problems and develop a solution for those, figuring all the rest of the details will work themselves out in the wash.

In addition, we can assume that a lot of the data processing that currently complicates XML-aware systems, such as generating tables of contents, reordering content, generating text and other decorations, rendering indexes, and so on, will all be handled in a separate pre-process phase such that the input to the composition engine should reflect the linear structure of the data as it will be layed out, lacking only those things that cannot be known in advance of doing layout and pagination. That also significantly simplifies the problem.

The system should expose all of its functionality through an API as well as a standard character-based input format (i.e., an XML syntax which would be most logically based on XSL-FO and extended where appropriate to reflect features XSL-FO does not provide)

It should be implemented in either Java or .NET. My preference is Java but if the architecture is correct it shouldn't really matter since the hard part is the algorithms, not the code writing.

The initial implementation effort should focus on completeness of functionality, not performance optimization. Once you get the data structures and algorithms right, then you can work on optimizing it (or commercial concerns can add value worth paying for by optimizing it, as we've seen in the XSL-FO world).

I think it's doable and I would love to have the opportunity to participate but it's certainly not something I can do by myself (I am no James Clark or Mike Kay). If I can think of a catchy name I might start a Sourceforge project for it.

And in the hopes that a puzzle might motivate someone to sign up for such a project, here's what I think is the essential challenge:

We have in the input data elements A and B. The semantics of A and B map to formatting rules that say that if A occurs on a left-hand page, B must be presented on the right-hand page following it, but if A occurs on a right-hand page, B may be presented on the either the preceding left-hand page or on the following right-hand page, depending on where B occurred in the original source data relative to A. As it happens, B precedes A in the input source. The first stage of the rendering process renders B in its initial location and then renders A. This happens to land A on a left-hand page. On the next phase, B's rendition is moved so that it is placed on the right-hand page following A. This frees up enough space so that A moves back one page to the preceding right-hand page. This now allows B to be moved back to its original location, which pushes A to a left-hand page....

How do you resolve this potentially infinite loop? This type of problem is endemic in documents where there are lots of out-of-line elements with complex rules about how they can be placed both relative to each other and relative to the ordinality of the pages they fall on.

And of course there are other similar challeges, such as tables that span pages both horizontally and vertically or the layout of footnotes that span columns or pages or vertical justification of text on the page or within a column or within a table.

These all involve feedback and the application of numerous rules with the attendant need to prioritize and resolve the rules dynamically and do it in a reasonable amount of time.

There must be a general architectural and algorithmic approach that address these requirements. I'm sure there is an established body of knowledge of how to address these general problems, which must be general problems in computer science beyond page composition. It's just a matter of putting it all together in the context of a well-engineered implementation.

How hard could it be?

Labels:

2 Comments:

Anonymous Anonymous said...

If this project can use an experienced diaper changer with a smattering of Java and XML skills, then count me in for a few hours :-).

On the subject of demanding use cases, how about generating change pages? This is needed for large publications that have frequent small revisions, where the cost of republishing the entire manual is prohibitive.

If the input file to the composition system contained the content of the original document plus markup showing what was added or removed during subsequent revisions, the system should be able to identify all the pages that changed and print just those pages. Also, at the authors discretion, the system should be able to insert alpha numbered pages or blank pages to ensure the the minumum number of pages need to be included with any change package.

Some would argue that this should be done as a preprocess to a composition system, but my opinion is that this kind of thing needs to happen in an application that knows what a page is, i.e. the composition system itself.

Oh, and how about FOTeX for the name? It shows that XSL-FO and TeX are influences, and it sounds like Faux-TeX.

Rick Geimer

10:15 PM  
Blogger Eliot Kimber said...

Hmm, I'm not sure FOTeX is quite what I had in mind for a name....

As for change pages, I agree that that is an important requirement, although I don't think your suggested approach will work, for the simple reason that the order of the source markup does not always agree with the order of presentation, sometimes dramatically.

Rather I think you have to have a way to associate the original source elements with their eventual renderered locations using some form of mapping that can then be used as input into a second processing run.

That is, as far as I've thought the problem through, you need to know what pages a given input structure fell on or occupied in a given rendition instance. Given that knowledge, as well as knowledge of the changes between the version of the source that produced the first publication and the version of the source that produce the second, you should be able to determine what the pagination differences are. Given that knowledge you can then produce the appropriate point pages or blank pages or whatever.

So to some degree, I see point pages being as much a problem of version management (which means a problem of XML-aware content management) as much as a problem of composition.

The only real composition aspect is designing the composition system so that it can emit or otherwise capture the element-to-rendered-page information. Given that information the actual paging or page numbering should be straight forward given a little flexability in how you can do page numbering in the composition engine.

But it's definitely a requirement that will require more thought.

Also, I think change pages can be seen as a special case of the more general problem of the need for feedback from the pagination result to the initial XML-to-composition-input transform, which you need to have in order to do any number of layout-aware formatting things.

6:13 AM  

Post a Comment

<< Home