Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Monday, July 24, 2006


So far I've talked about XML-aware content management from the point of view of the functions and services such a system should provide and how it should be implemented. But it's possibly even more useful to think about how and why such systems are used.

The question then is: why do you use an XML-aware content management system?

First, let me be clear that this is all in the context of document authoring, not delivery. Delivery, while it can use the same underlying technologies, is a completely different kettle of fish where different requirements come into play. In general, delivery is simpler in terms of business logic but more demanding in terms of scale and performance. That is, an individual author can easily create a set of HTML pages on their crappy old laptop managed by CVS on some hagged out old PC but serving those pages to million users requires the beefiest of servers.

I'm not sure I've made that distinction completely clear so far in this discussion.

In addition, I always assume that the authoring repository and the publishing/delivery repository are physically separate and connected through a publishing process such that data from the authoring repository is processed in some way and the results are loaded into the delivery repository from which they are then made available to their intended users/readers/consumers.

One thing this means is that there is always a processing step between the data as authored and the data as delivered that is a key opportunity to apply processing. In the simple case this process is a null operation--you're just copying files. But in most real world processes you're doing something, even if it's just stripping out internal comments from the XML code.

The main reasons for using (or thinking you'll need to use) an XML-aware CMS for authoring support include:

- Versioned management of document source files

- Distributed access to document source files stored in a central location (i.e., files not scattered about on local computers)

- Control of access to specific files by specific users

As we've seen, these three uses can be well satisfied by Subversion or similar version control systems (CVS, etc.). For a lot of use cases this is all you need, even if you're doing more sophisticated things like cross-file linking or re-use. In my own day-to-day work this is how I operate. This works because the scale and scope of the linking I'm doing is small enough that it can be managed just through a little care, clear file organization and naming conventions, and rendition processors that do the link resolution and validation.

- Support for re-use of XML content at relatively fine levels of granularity

- Support for publishing of documents that use link-based re-use and hyperlinking (i.e., link resolution services)

- Full-text search and retrieval of XML content in support of authors doing re-use or re-purposing of content

- Management of the creation, maintenance, and resolution of hyperlinks for documents under active development

- Management of documents through development workflows

- Application of descriptive taxonomies to bodies of information to assist with classification, navigation, and retrieval of that information (i.e., Topic Maps)

- Support for large scale/large-volume publishing (for example, publishing all the manuals for an enterprise's products to the Web as HTML and PDF)

- Long-term archival storage of document source for documents that may need to be revised and republished long after they are originally authored.

- Management of different language versions of localized documents in order to support localization activities

Of these, most enterprises using XML for technical document authoring have a strong re-use requirement. Enterprises that use XML for things like textbooks or trade technical documents may have a less obvious re-use requirement but even in these cases there are pretty strong uses for re-use. For example, in textbooks there is often the main textbook as well as an instructor's manual and a lab book. These three publications likely share (or could share) quite a lot of their content. Any textbook that sells will also go through a number of revisions over its lifetime--using material from published version 1 in published version 2 without change is also re-use. Textbook material may also be re-used in different publications or re-purposed in customized versions of textbooks. You get the idea.

So in general, even if you started using XML just to get some better control over the rendering aspects of documentation you will quickly realize that there are significant opportunities for re-use regardless of what your documents are about or who consumes them.

Therefore I like to focus on re-use as a key user requirement that moves us from being able to use simple versioning systems to really needing more sophisticated XML content management systems that support the needs of re-use directly. In addition, if you do re-use the way I think it should be done, that is, using hyperlinks and not external parsed entities, then there is a requirement for link management that, once satisfied, will also satisfy the requirements implicit for using links for navigation.

I also focus on re-use because most enterprises do much less linking than they thought they would and more re-use. For example, in a lot of technical manuals the navigation linking is limited mostly to cross-references within the document or links to entire separate publications. In addition, the use of re-use tends to make navigation linking harder (for reasons I'll get into before too long), so management and/or authors make the conscious or unconscious decision to limit the amount of linking they do.

I'll talk about all that in more detail, rest assured.

For the other requirements, I'll make a few brief remarks:

- Support for publishing of documents that use link-based re-use and hyperlinks

If a publication is represented as a system of linked XML documents and their attendant non-XML components (graphics, etc.) (a "compound document" in my personal terminology) then you need a way to publish those documents, either directly out of the repository or following some sort of export operation, such that all the links are resolved and the published result will work in its published location (e.g., on the corporate Web server, on a CD, as online help, inside a mobile phone, whatever). This means that the repository has to provide whatever services and functions are necessary to give the publishing system what it needs. In the simplest case this means just providing access to the files because the publishing process can do the rest itself (this is my situation where I have XSLT scripts that can do all the re-use resolution and link processing necessary to produce HTML or PDF from the documents I author in XML (which is all the documents I author that I don't author in HTML or PowerPoint). In the more sophisticated case this is some sort of "rendition server" that publishes documents straight out of the repository, uses the repository's API for link resolution, does optimization of time-consuming things like accessing graphics, and knows how to write out all the publication components so they all work correctly.

- Full text search.

This is an important requirement--who doesn't want to know what they've got and be able to find it quickly? My personal opinion, based on both my overall archtictural model and my implementation experience, is that full-text search is best implemented as a component separate from the core storage and link management features, but operating against the same storage objects. That is, if you store your XML as storage objects (and not as decomposed elements in some object database), then you can easily apply any modern XML-aware (or unaware) full-text indexing system against it. It then just requires a little bit of integration work to resolve the rearch results returned by such a system to the specific documents and elements located.

There are a number of good XML-aware full-text indexing systems out there and there are some remarkably good low-cost solutions, including the Lucene search engine (part of the Jakarta project) using the technique outlined by myself and Brandon Jockman,and Joshua Reynolds for using Lucene to index arbitrary XML documents (the paper is here: ). Doing a Google search just now on "lucene and xml" revealed that there's been a good bit of activity in this area since we wrote our paper, which is cool. I'll have to look into that more when I get a chance.

The main challenge with indexing XML is knowing what to index and how to index it. The simplest thing to do is to simply index every element and attribute, but for a sophsiticated schema, it might be useful to index things more in terms of the the business objects the XML reflects rather than the raw XML. There's a lot to say there but I'll just leave it at that for now.

This approach to indexing also just good separation of concerns. As Tim Bray says in this blog post, there are lots of useful ways to index XML. It only makes sense then that you design your system so that you can either change the index as you need to or use multiple indexes and indexing technologies over the same body of data.

- Workflow

Workflow may or may not be an important feature. As I've said, I haven't seen a need for more than just tracking documents or elements against authors and workflow stages, but in any case, workflow requirements will always be very specific to a given business or enterprise so it will always be the case that implementing workflow requires a lot of custom work, even if you're using an off-the-shelf workflow engine. But in my experience, time and budget spent on workflow is time and budget better spent elsewhere in the system.

- Link management

This is where the fun is and I'll talk about it in great detail going forward. The main thing with link management is that it's very easy to go horribly horribly wrong here. It's also very easy to over-engineer and under-engineer. I have sufficient hubris to think I know how to do it right.

- Application of taxonomies to documents

This is really an extension and refinement of indexing but is usually approached as a separate activity. I think there's some potential value here but I think that for the vast majority of enterprises using XML for authoring technical documents, it will take a long time to get to this level of sophistication and the ROI of providing these features is going to be hard to quantify because it can at best provide incremental improvement to what will already be possible with basic search and retrieval and just having a working knowledge of the information domain. I think the lesson of Google here is very instructive: you can get a lot of mileage out of a plain text searches if your initial index is clever and your ranking algorithm is appropriate. This is not to say that doing classification is not useful--it absolutely is, and that having taxonomies that organize the individual classifications are not useful, they absolutely are. But given a defined taxonomy the minimum you need to use it is a way to classify individual documents or elements, and a generic name/value metadata mechanism gives you that. The hard part is defining and maintaining the taxonomy itself and that's a human problem, not a technology problem. My personal opinion is that application of taxonomies is an area that tends to be overvalued and overengineered, or at least implemented earlier than it should be at the expense of some more immediately useful feature. Of course in environments where taxonomic classification is more essential to the information, such as encyclopeas, that equation might change. Or it might not.

- Large-volume publishing

This is mostly, I think, an exercise in scale and performance that doesn't require any particular XML-specific approaches, just good engineering. The technique and technologies are well understood. The main issue, I think, is usually whether or not to process the source documents in place or copy them first. For one system with which I'm familiar (but did not contribute to the publication parts of), they had to go behind the back of the repository and access the documents directly from disk--they couldn't even afford the time it would take to export the documents (stored as storage objects) from the repository. That sort of thing.

- Management of localized document versions.

This is a challenge and I can't claim to have any definitive answers here, but I think that trying to manage it at a very low level of granularity is not the best solution. In the systems I've been directly involved with it was managed at about the subsection level but that may have been as much a side effect of the quick-and-dirty management approach we were using as it was an appropriate solution. The management of localization in general is compounded by the fact that most localization support tools (tools that support translators in doing the translation of the files) are pretty weak in terms of their XML support and therefore impose some constraints on how the data can be provided to them. In short, as far as I can tell, it's something of a mess. While I do claim expertise in the composition of localized documents I am certainly not an expert on localization itself.

My instinct is that this is fundamentally a linking problem in the way John Cowan alluded to for versioning in one of his comments to an earlier post. But I haven't had the opportunity to prove or disprove this theory in practice.

There are some commercial XML-aware CMSes that claim to be optimized in their support of the management of localized documents. I can't comment on these tools one way or another because I haven't directly worked with any of them. I suspect that most, if not all, of these products violate my architectural principles. The question is: is that a necessary decision in the face of the requirements inherent in localization management? I don't know.

- Long-term archival storage.

This one seems easy to me: write the stuff to optical disk or tape using existing archival technology. Of course this presumes you're storing XML as XML and not decomposing it to objects. I think the challenge is also archiving the linking and metadata information. Here's where having an XML representation of all of that can come in very handy. If you use XSLT scripts to at least express and document the business logic for processing the XML representation of metadata and link indexes, you can archive that too and at least have some hope of recovering its utility 40 or 50 years from now. If all your indexing and metadata storage is bound up in proprietary object data models you're going to have a bit of work, if you can get it at all (worst case you're forced to re-implement what the CMS already does just in order to create an archivable version of it, which calls into question the existence of the original CMS in the first place).

Which brings me back to one of my design and implementation principles: design the whole thing as a system of XML documents processed using XSLT (or DOM-based programs if the business logic is too complicated for XSLT to be a productive implementation language). This ensures several things:

- Your data designs are clear because they are transparent, being expressed in easily-accessible XML forms (XML itself, schemas, etc.)

- You have a built-in, ready-made XML-based interchange/archival representation of your system from the get go.

- You can validate and refine the data structures and business logic with a minimum of effort and expense before doing any necessary optimizations

XML standards exist for everything you need to represent links, capture metadata, and express dependency relationships. The necessary processing infrastructure is there. It can all be built quickly in a low-scale, low-performance way very quickly. And you just might find that what you thought wouldn't scale does and what wouldn't perform does. That has been my experience.

So to sum up:

- If what you need is just versioning, access control, and distributed access, normal versioning systems, possibly with a little bit of additional customization, will be just fine and impose a minimal cost of entry.

- Of all the higher-level requirements, re-use is the one I focus on first because it's what most people really need most and, as a special case of linking, implementing good support for re-use gives you most or all of what you need to support linking more generally. Oh, and using external parsed entities is not re-use.

- Workflow is, in my opinion, overvalued and or overcomplicated. Simple solutions for workflow management are often more than enough. If you think you have a strong workflow requirement, push on it really hard before you invest a lot in satisfying that requirement.

- Management of localized documents and the localization (translation) process is an inherently hard problem for which there are currently no obvious best solutions. The problem is compounded by weaknesses in the current translation support tools, which have weak XML support

- High-volume publishing and archiving are important features but do not present any particular XML-specific challenges, meaning they can be addressed using standard engineering approaches for scaling and archiving. For archiving, the only caveat is "do you have a standards-based way to represent the entire state of the repository, or at least the state you need to archive at any given moment?" The answer needs to be "yes".

- Taxonomic classification can be done using generic storage object and element-level metadata. It's usually not a phase one or phase two requirement. It's often overvalued and overengineered

- Assuming you're creating documents so you can publish them you'll need some form of dedicated publishing support component that provide the data and metadata in the repository to the publishing processor so it can at a minimum resolve re-use and navigation links, access document components, and so on. More sophisticated "rendition servers" may need to do things like job queing, load balancing, optimization of access to graphics, and so on. Depending on your publishing scale and performance requirements, you may need to have the rendition processes occur on dedicated machines separate from the repository servers, for example.

If you are thinking about your XML content management needs and in the process of evaluating different tools or implementation approaches I urge you to think about you requirements very carefully and very critically. As you can see I am focusing very heavily on what is the simplest thing you can do that will satisfy your requirements. Full-featured CMSes are expensive no matter how you slice them. Whether you build your own from scratch (my personal preference) or buy off the shelf and customize it, implementing every feature that could or would be useful will cost a lot of time and money. It's very likely that if you are representing a typical technical documentation group, even one within a large, successful enterprise, that you don't have a lot of budget or that the executive champion you have today who is all gung ho will not be there in 18 months or two years when the budgets get re-evaluated and people start asking hard questions about the system you're building.

So there's a lot of value to starting small and working your way up. But doing that successfully requires clear thinking about your requirements and being really honest about what's a must have and what's a nice to have.

My experience is that once the authoring and management community starts to get what the potentials are for using XML in clever ways they start to think of all sorts of things that would be useful but that would raise the cost of a system very high, especially if they are implemented too early, before the core facilities needed to really support those features are in place. Once the core features for metadata management and versioning and re-use are in place, then you will find, if you've architected implementation flexibility into the system, that doing these more sophisticated things becomes drastically easier. But if you try to do them too early you will only know pain.

Next up: Whatever my fevered brain decides is most important at 5:30 in the a.m.



Blogger John Cowan said...

Life's a funny old (female) dog.

I once was interfacing to a workflow system, picking up the output (fully ready to go documents) and delivering them.

The original version of the workflow system was about as simple as it could be. It had the editorial people move a file around from one directory to another: there were directories for draft, written, copy-edited, edited, final, and released states. (The difference between final and released had to do with respecting embargos, and was the only automatic step.)

Then a replacement system was written in Access and VBA, using just two directories and keeping the current state in a database. (The second directory was the released directory, so that my code didn't have to read the database.) This was easier to use and more reliable, but subject to its own annoyances.

If I were redesigning the workflow system now, I'd take advantage of one of the fundamental differences between CVS and Subversion, namely that CVS versions individual files, whereas Subversion provides version control over a whole directory tree, which you can organize however you like. Conventionally it's used to represent branches and tags, but Subversion itself (as opposed to the scripts that convert from CVS) doesn't know anything about that.

Instead, the directory tree could be used to represent the workflow state. The same directory scheme used by the original system could reappear, and people could change the state by direct manipulation of the file, dragging it into the (versioned) directory where it belongs.

"First there is a mountain / then there is no mountain / then there is." --Donovan

12:07 AM  

Post a Comment

<< Home