XML Content Management the Dr. Macro Way: Simple Is Good
Let me start with the story of what is both one of my greatest triumphs and biggest failures as an integrator of XML content management systems: Woodward Governor
Woodward Governor is a large corporation that makes governors of all kinds (they started making governors for steam engines and went from there). We (ISOGEN) were contracted by Woodward's Jet Engine Fuel Control division to help them build a system to manage the assembly, repair, and test manuals for their fuel controls. Fuel controls are the parts of jet engines that control the flow of fuel into the engine. They are of course safety critical components that are carefully engineered. They are complex parts that must be assembled and tested with the greatest of care.
The assembly and test operation is driven by detailed procedural manuals that outline every step of every process in specific detail, including references to specific part numbers.
This presents several challenges to the technical writers, mostly related to the management of part numbers. The manuals must accurately reflect specific engineering versions of a given model of fuel control. For example, an engineering change might change a given o-ring from one part number to a different one with different mechanical and chemical characteristics in order to address a design flaw that could lead to failure in operation. Obviously you want to make sure that the technician operating on that specific model of fuel control sees the correct part number in their copy of the manual.
The approach that Woodward wanted to take (and the approach I would have recommended) was to do two things:
1. Enable the management of model-specific versions of manuals that refect specific variants of each model. Ideally you would have versions that reflect specific serial numbers, but at least at the time, that was impractical.
2. Ensure that each part's definition (including part number) occurs exactly once in the source data and is used by reference everywhere it is mentioned.
In addition, the manuals themselves were phyically organized into many component documents (i.e., one component per task or subtask--I don't remember the details now), combined together using what, at the time, was HyTime's value reference feature but would today be a specialization of XInclude, to create compound documents.
So here we have the classic XML content management challenge of managing systems of versioned hyperdocuments with lots of version-specific links. The total volume of information was large but not staggering: on the order of 100s of separate document numbers (the unit of publication) and 10s of thousands of individual files, with hundreds of thousands of link instances.
Unfortunately (at least for us as integrators trying to make a buck), Woodward only had enough budget to pay for either analysis and design or tools and integration, but not both, so they (correctly) went for analysis and design. We still had to produce a minimally working prototype system but as there was no budget we had to do it as much as possible with free tools. At the same time I was instructed to "don't make it scale or perform--we want them to have to get more budget to build the production version." Fair enough--they weren't paying for a full system and we were under no obligation to provide one.
The approach I went with was as follows:
1. Use CVS (Concurrent Versioning System) for core file management and versioning. Today I would probably use Subversion but the concept is the same. CVS provides perfectly good versioning of text files. It is stable, reliable, and well supported. It is easy enough to integrate with. It allows you to access specific versions of files. It's free open source. It is easy to set up. All of these characteristics now apply to Subversion, with the added benefit of full support for Unicode data and binary diffing, as well as more complete versioning semantics (such as atomic commits of multiple files, versioning of directories, and so on, all of which I consider essential for a fully complete versioning solution).
2. Use a simple relational database to hold the metadata about part numbers needed to be able to support authors who need to be able to accurately create links to part number definitions (remember our single-source instance requirement).
3. Write custom code to do the following:
- Integrate the authoring tool with CVS for check-in/check-out
- Extract link index information on check-in (via a simple CVS "on commit" hook) and load it into the database
- Integrate the authoring tool with the link index to enable easy creation of links
- Process the compound documents in order to resolve all links for publication and/or dynamic online delivery
This is all workaday stuff and none of it was very hard. We used the now-defunct GroveMinder tool for all the HyTime-related processing (link resolution) but the rest was pretty standard integration stuff. We used Python as the implementation language for the server-side stuff because it's an easy language to use, did what we wanted, and could be taken over by the client.
The link index information was pretty simple, just relating part numbers to the XML elements that defined those part numbers in a specific version. I'm pretty sure we also captured the part name so we could put that in a GUI for link creation. When I started I was just going to store the link index in a flat file (remember the "can't perform or scale" requirement) but I found it was going to be tedious to implement the reading and writing of the file so I looked around for the simplest database I could find. What I found was Gadfly, a SQL database implemented entirely in Python. It had no persistance, holding its tables entirely in memory. "Perfect", I thought, "this can't possibly scale or perform".
Boy was I wrong.
Because it is so light weight, Gadfly is very fast. Because the scale of the system was relatively small there was no issue with the tables in memory being too big (and memory costs dropped much faster than the system's scale increased--the number of different fuel control models is relatively stable and increases relatively slowly).
Because this was just an index of the data in the XML persistance wasn't an issue either: as long as the server wasn't turned off, the tables where there. If the tables went away, all you had to do was re-index all the docs and you're back.
So the damn thing performed and met Woodward's scale requirements.
This was in 1996/7 time frame. Last fall I happened to talk to the guy at Woodward who had originally contracted us to build this system and who is still there and he said that they were, 10 years later, starting to think about replacing Gadfly with a beefier database, maybe MySQL or something. Otherwise the system was essentially unchanged.
It had met all their operational requirements.
I had at the same time, succeeded amazingly well (who out there has built a system that has operated largely unchanged for 10 years while requiring minmal maintance or change?) and completely failed to produce a system that would neither perform nor scale.
This is also the simplest system I have ever built.
There is a lesson there.
The lesson is: managing XML content is not actually that difficult and remarkably simple systems can work very well in most circumstances.
I have since repeated this experience a couple more times.
The key lessons I took away from this experience and that drive all my thinking about content management are:
1. Manage the XML source as versioned storage objects
2. Do all semantic processing, including link managing, metadata indexing, etc. as separate activities on top of or separate from the core storage
3. All of the complexity in XML content management is concentrated at the boundary between the repository and the outside world and that is where the system's complexity should likewise be concentrated.
If you examine any XML content management system critically in terms of these principles you can quickly identify potential problems.
For example, any system that stores the XML by decomposing documents into individual elements as objects has violated the first two principles. They have, in effect, conflated the data storage (storing the base XML character stream) with the indexing and semantic processing. Let us call these systems "element-decomposing systems".
It's easy to understand how an engineer might think this is a good idea: "Hey, XML elements are essentially objects. We need to be able to find and retrieve them quickly. Why not make them objects in an object database?"
Unforuntately, it just doesn't work well in practice. A little thought should reveal why, but I want to focus on the core principal of separation of storage from indexing, because from that all else follows.
If you think about the semantic processing of XML, especially in the context of document authoring, that processing can be broken down into a few basic operations:
- Finding content based on markup alone
- Finding content based on text content alone
- Resolving link relationships among elements
For example, the basic use case of finding the right XML chunk in a large body of content in order to re-use it requires being able to search across the body for particular content in a particular context and then either literally copy it or create a hyperlink to it. Having created that hyperlink, processing the result requires resolving the hyperlink to retrieve the linked content and then process that.
I like to break these down into two basic operations: indexing and relationship management, where the implementation of relationship management may take advantage of indexing.
Thus I will use the term "indexing" to mean any sort of retrieval applied to the XML content based on any characteristics of the XML data stream using any method.
The term "relationship management" means any processing involved in capturing and using knowledge of referential relationships expressed by the XML content in whatever form.
If you think about how you might go about doing indexing and relationship management you should quickly realize that there are many technological approaches one could take and that different approaches will be appropriate for different use cases. For example, the specific approach I took for Woodward Governor would not work for an enterprise with much larger body of documents or 100 times as many authors (Woodward had about 10 authors vs the 300 in my Information Development group at IBM back in the day) or much more demanding performance requirements.
Woodward did not, for example, have much of a requirement for searching across their body of information. Because of the nature of their information and the products themselves, it was always clear where related information that could be re-used was and how to find it. If they had had a general search requirement I would have had to do something more or approach the system in a completely different way (at the time, there was no inexpensive solution for XML search and retrieval. Today there is.)
Given that there are many different useful and productive ways to do indexing and relationship management and given that the service life of most technical documentation is measured in years if not decades it should follow that in building a system to support the management of those documents that you would want to build into that system the most flexibility you can so as to minimize the future cost of reconfiguring or reworking the system to react to new requirements or new technologies. Not to mention protecting yourself from the vagaries of business life cycle--with the company you buy from today be around in 20 years to support their software?
This is simply sound engineering. One way to express it more less is as avoiding premature optimization.
This is a fundamental principle of good software engineering. It says essentially: don't optimize your software any sooner than you absolutely have to. The reason for this is that optimization almost always compromises the clarity and cleaness of the system in terms of its orgainization into distinct modules connected through APIs or layers of abstraction that separate the core data models from details of implementation.
The Woodward system is an example of building a "pure" system that did no unnecessary optimization at all, yet the system performed well. At the same time, because the system is composed of well-distinguished components connected through reasonable layers of abstraction, it would be easy for Woodward to extend or refine their system as necessary (for example, to replace Gadfly with MySQL or CVS with Subversion or even something like Documentum if they felt so inclined).
This can also be expressed as "separation of concerns" whereby the distinct domains of processing (business logic) are implemented by distinct and separable system components.
Now apply these principles to the typical element-decomposing system. By making each element an object they have optimized retrieval of those objects but at the cost of separation of concerns because there is no clear separation between the storage of the XML data stream and the storage of the index over that stream (because the objects and their attendant metadata are the index).
So the first obvious problem is no clear separation of concerns: is there a way to reconfigure the system so that the XML data storage and the index are separate? Can you replace one without replacing the other? If the answer is no to either question then there is no good separation of concerns.
The next question is whether or not this constitutes premature or inappropriate optimization. You answer that question by determining if you could get comperable performance using another approach that had otherwise better characteristics (such as better separation of concerns, lower upfront or ongoing costs, smaller system footprint, etc.).
I would suggest, based on my experience, that the answer is invariably "yes, this is premature optimization". My confidence stems from the fact that you can get the indexing and retrieval features you need relatively easily without doing decomposition. I know because I've done it and others have done it. In addition, the premature optimization of the element-decomposition approach brings along with it a number of unavoidable limitations that other approaches do not, including:
- High sensitivity to schema changes. If the object schema directly reflects the XML schema, then any change to the XML schema requires changes to the object schema. In the worst case (which seems to be the typical case) this means exporting and re-importing all the data with that schema. If the object schema is more generic (i.e., document goes to element goes to element or pcdata) then much of the advantage of using objects for indexing is lost because key information, such as element type names has be captured and maintained as secondary metadata, not as primary object types. Since object database are presumably optimized to retrieve objects by their base type, using a more generic schema will not take advantage of the optimization that using an XML-schema-specific object schema would provide.
- High overhead for limited use. One argument for decomposing documents at the element level is to enable re-use of individual elements at any level. The problem is that in practice only a fraction of all the elements will be re-used in this way yet the system must decompose them all. Objects have very high overhead compared to the raw string representation of the XML data. As a result, element-decomposing systems tend to either be slow, at least for some operations, such as import or export, or have to do lots of twisted things to achieve reasonable performance, or require a very large system footprint, or, as is typical all of these.
- Inflexible units of chunking. In order to address the issue of too many objects some systems require you to pre-define which elements in which contexts are chunked and which are not. This leads either to limiting chunking and therefore potentially disallowing finer-grained re-use or getting back to chunking everything (and thus not avoiding the the problem of too many objects).
- Built-in limitations in terms of how things are stored based on either limitations in the underlying system or arbitrary or short-sighted implementation decisions made at sometime in the past. For example, many systems cannot store data in different encodings--it's either all UTF-8 or all UTF-16. Some limit the way that links can be represented syntactically. These limitations tend to be surprises to users and implementors that come up at the least opportune time or require changes to data and/or existing business practices.
At time that many such systems in use today were first developed systems were much slower and the general processing infrastructure that we have today with XML did not exist. Engineers either had to or felt they had to apply these sorts of optimization approaches just to make a working system. I think they were wrong even then but that's in hindsight and I don't fault anyone who thought object decomposition was a productive approach then. I do fault anyone who thinks that today because it's demonstrably provable that object decomposition, in today's world, is simply unnecessary and suboptimal as an engineering approach.
To summarize this first set of thoughts:
- Remarkably simple systems can work remarkably well
- Premature optimization is to be avoided with extreme prejudice
- System composition into software components should reflect the long expected service life of the system and documents it will manage. This means clear separation of concerns reflected through clean APIs and layers of abstraction
- Computing system limitations that held 20 years ago no longer hold.
Finally, here is one way that I like to think about designing XML content management systems:
Can I implement all the functionality required using Subversion (nee CVS) and XSLT (possibly with a few extension functions to handle specialized business logic, such as connecting to another, pre-existing information system)?
That is, can I prove my understanding of the requirements and business processes through the implementation of a system using a brute force mechanism?
If the answer is yes, then the next question is, why don't you? If the system is engineered with the approapriate componenatization you know that you can subsequently refine the system as needed by reworking and replacing components to make them perform or scale as needed. But you might also find, as Woodward did, that the simplest possible solution is in fact good enough, in which case you've saved a huge pile of time, money, and pain.
If on, the other hand, you go for a monolithic, pre-optimized, uncomponentized system you are taking a big risk that the system will not satisfy some critical requirement.
The next subject will be a discussion of the boundaries--my third lesson learned above. That is, regardless of how well your system performs the tasks of data storage, versioning, indexing, and relationship management, you still have to get things into it and out of it and that is where most of the complexity lies for a variety of reasons.
If you review my description of what we did to build the Woodward system above, you will see that amost all of our effort went into the edge stuff: supporting import and export. The effort needed to build the core repository and indexing was trivial by comparison. I would hazard that easily 80 or 90 percent of our time billed was on boundary stuff (import, export, authoring tool integration, etc.).
Thus, regardless of what form the core repository takes and how it implements support for data storage, indexing, and relationship management, there will be significant effort required to integrate that system into a given operating environment and adapt it for use with specific XML schemas and XML processes (publishing, retrieval, data extraction, etc.). Thus one way in which XML content management systems are also distinguished is the degree to which they both provide useful boundary functions and enable implementation of customized boundary functions.
Because most of the integration effort will be concentrated on the boundaries, it further supports the engineering conclusion that minimizing the amount of effort spent on the core functionality is good because it maximizes the amount of the total implementation budget that can be spent on implementing the boundary functionality.
In addition, because the boundary functionality is almost without exception schema and business-process specific, it follows that being able to apply the engineering discipline of separation of concerns is a primary concern because the boundary functionality, once implemented, will be more or less invariant regardless of how the underlying repository is implemented. Therefore the boundary functionality implementation should, as much as possible, be distinct and separable from the repository itself. This is usually not the case (at least out of the box) for most monolithic XML content management systems (regardless of whether or not they are element-decomposing systems).
I will explore these issues of boundary complexity in my next posts on this topic.