Dr. Macro's XML Rants: XCMTDMW: Characteristics of an XML CMS, Part 2

In part 1 of this discussion of CMS characteristics I outlined those features that apply to the management of storage objects. Storage object management is a prerequisite for any more sophisticated semantic management. It is also generic in the sense that the storage object manager (the repository) does not, by itself, know or care what is inside the storage objects. It just holds the data, manages version relationships, manages arbitrary metadata for storage objects, and manages other arbitrary dependency relationships among storage objects. We also discovered that the Subversion system does all of these except the last (and it might be made to do that one too--I haven't had the time to look).

This means that for a zero up-front software cost and a modest amount of custom integration effort (a few person weeks to a few person months) you can implement an adequate and quite useable XML-aware content management system. It won't do everything you might like it to do but it will be freaky cheap compared to any commercial system you might buy and it will be relatively easy to replace or refine as needed.

With a system of this sort, any XML awareness would be in the boundary code, that is, the code that takes XML and puts it into the repository, examining the XML data to extract any metadata to be applied to storage objects, to instantiate dependencies, and so on. This is the integration effort and its scope is entirely a function of the local requirements. In the Woodward case it was minimal because we only cared about part number metadata and nothing else. In your case you might care about a lot of stuff or have much more complex links or whatever. But since it's integration you're driving you can choose how quickly and how complete you need it to be. You can get a minimally-useful system up and going pretty quickly and with a very modest investment. I think this is very important, especially in the context of technical documentation where budgets tend to be modest anyway and all the other aspects of moving to XML-based documentation are probably sucking up a good chunk of it (for example, converting legacy documents into XML, developing your schemas, developing your publishing systems, acquiring and customizing authoring tools, and so on).

I will also observe that it's easy to imagine lots of cool things that an XML-aware content management system could do for you or could enable you to do with your information that in practice you would never do, for the simple reason that they're too complicated for authors or too fragile at scale or too difficult to quality assure or make the system too expensive or whatever. I certainly see this with fine-grained re-use. Enabling re-use at the content management and processing level is not that hard but doing large-scale re-use across large numbers of publications or across large numbers of products and authors exposes a number of difficult human management and communication issues that have no easy solution (because they depend on people communicating with each other and doing their jobs well).

Just saying.

So what are the semantic processing features that an XML-aware content management system needs to provide? They include:

- Maintain knowledge of element-to-element link relationships (cross references, Xincludes, etc.). This should include the information needed to answer "where-used" questions for individual elements.

- Enable quick location of and access to elements via standard addressing methods (XPointer, XPath, XQuery, etc.)

- Enable quick location of and access to elements based on their text content (full-text search)

- Maintain and manage element-level metadata that is not necessarily inherent in the underlying XML source.

- Provide services (APIs) by which software components can use the foregoing

These sets of features could be classified as "XML-aware indexing". The link-awareness component is part of what most people think of as "link management" but is not the whole thing (like storage-object dependency management, it's a necessary but not sufficient prerequisite for full link management).

Each of these features can be implemented in different ways with different levels of completeness. For everything except element-level metadata management, they could all be done through brute force with no persistence at all. Of course this would not scale but it's very cheap to implement (just hack up the XSLT script you need).

The location of elements using XPath or XQuery can be implemented using a more or less brute-force technique of simply locating the documents you want to examine and then applying the xpath or query to them directly with no pre-indexing or anything. [Remember that our storage-object-to-storage-object dependencies let us quickly determine which documents to search given a starting document.]

For example, say you want to resolve a link from element A in DocA to element B in DocB. To do this you will have to take the address uttered by the link (let us assume it is an XPointer) and resolve it to the element itself. As every element exists in exactly one document you first resolve the storage-object-to-storage-object dependency implicit in the link (that is, the LinksTo depenedency from DocA to DocB). That gives you DocB. You then apply the XPointer to DocB to get you to the element.

In practice, this scenario works well regardless of repository scope as long as the individual documents are not too large (say 10's of megabytes). That's because it just doesn't take that much processing time to parse a document of typical size and find a particular element. The only real scale issue would be posed by large numbers of users all doing resolution at the same time. In that case it would probably make sense to have some sort of cache or index, but that's an optimization that can be applied later. The core functionality can be implemented quickly and easy using a little XSLT or DOM programming. In most technical documentation environments individual XML documents tend to be quite small, usually less than 100K bytes, because the information is usually chunked up for ease of authoring or re-use. Any technical documentation environment where individual XML documents are larger than 1 Meg of XML are not doing something right.

For element-level metadata you do need some sort of persistent store or index, but it doesn't have to be complicated. The minimum you need is a four-column table: attribute name, attribute value, containing document, location of element. Totally brain dead but it would work. Of course you'd immediately realize that you should probably have one table per attribute and if any attributes represent pointers to other attributes you'll need to capture those relations but that's all workaday stuff for anyone with a little RDBMS experience and again represents optimizations that aren't strictly necessary just to make the thing work.

With these features in place you now have all the core information management features needed to implement arbitrarily sophisticated business logic over your body of information. You can implement workflow (using dependencies and metadata to track status, ownership, and so on), you can implement publishing (using dependencies and link information to feed your composition or HTML generation process). You can implement author support features like general search and retrieval over the repository, integration with authoring systems, convenience UIs for creating links or other complex structures, can build more specialized indexes to satisfy different business processes or authoring tasks, and so on.

Let's take a moment to catch our breath and talk about the overall shape of the system as a collection of components.

If you've been able to hold everything I've described so far in your head (and if you have, I'm impressed, because this is not an optimal medium for presenting these complex concepts) then you should be starting to see that the system as I've described it breaks down into three main layers:

Layer 1: Storage object management. This is the layer that manages storage objects as opaque objects. There is no XML awareness at this level in the system: storage objects are just sequences of bytes (e.g., files) or collections of pointers to other storage objects (e.g., directories)

Layer 2: Generic semantic management. This is the layer that provides generic facilities for managing information about the XML as XML (the first level of XML awareness). These facilities are generic because they can be applied to any XML data regardless of schema or application semantics.

Layer 3: Business-Process-Specific semantic management. This is the layer that provides schema-specific or business-process-specific functionality, the second level of XML awareness. This is the layer that binds knowledge of the application-level semantics of a given bit of markup to the generic features needed to make implementation and use of those semantics possible or practical. For example, a simple workflow mechanism that uses element and storage-object metadata to track the progress of a specific XML element through the development process is going to bind the specific development process (the series of stages, the business rules for when you can move from one stage to another, what happens when you do or don't move, etc.) to the generic metadata facilities to both set and examine the relevant metadata values.

When most people start thinking about their need for a content management system they of course focus on the features that will be exposed to users or that reflect the business-process-specific processing of their documents. If you think about these features, whatever they are, in terms of the foregoing layers, it should become clear that all of them can be modeled as Layer-3 components that use the generic facilities in Layer 2 and Layer 1.

It should also be clear that each layer can itself be organized into peer components that can be (and usually should be) separable. For example, in Layer 2 you can have a full-text index component that is separate from the link index component that is separate from the element metadata component, as least as exposed via APIs. Under the covers all these components might be implemented using the same underlying software component or they might not. Dilligent engineering suggests that even if they are implemented using the same component there should be a level of abstraction between the API exposed to Layer 3 and the underlying implementation so you can change the implementation decision without having to change your APIs or the implementation of Layer 3.

Note also that the code in Layer 3 is, by definition, use-case-specific. That means two things: it's unlikely you will ever find an off-the-shelf system that provides more than part of the Layer 3 functionality you need [although see my comments about DITA at the end of this post] and it's where the bulk of your investment in building the system will go.

This is very important. To the degree that the code that implements Layer 3 is loosely coupled to layers 1 and 2, your investment in that code is protected. To the degree that it is tightly bound, your investment is at risk.

It should also be clear that one implementation choice that cannot in any way reflect this model is the element-decomposition approach, for the simple reason that it combines the storage layer, the element metadata layer, and the other Layer 2 components into a single, monolithic component (the object store containing the objectized elemements). While it might provide some generic features for element metadata and so on, as I discussed in an earlier post, if the object database schema directly reflects the XML schema, then it is also combining Layer 3 components as well. If it the object schema is generic then it is probably not providing any particular optimization for indexing and retrieval relative to other, simpler, more componentized approaches. Thus, no matter how you slice it, it comes up suboptimal at best, counterproductive at worst.

In most businesses that do technical documentation, the basic business processes and information schemas are pretty stable once established. For example, in any type of product documentation, the nature of that documentation is not likely to change drastically because the nature of the product is not likely to change. For products with very long service lives, such as commercial aircraft, this is obvious (commercial aircraft typically have rated service lives of 50 years or more but are sometimes kept in service longer than that--some DC3s built in the 40's are still flying). For products with much shorter service lives, such as mobile phones it's still the case. While the individual manual might only have a life span of two years, the product line might have an expected service life of 10 years or more and the business of mobile phones will continue indefinitely until some technology completely replaces them (which seems unlikely to happen in the next 20 or 30 years).

So if you're going to invest 100,000 or a million Euros supporting your business processes you'd like that investment to serve for at least a decade if not much longer. But you can be sure that any implementation technology you use today will be either obsolete or unavailable or unsupported 10 or 20 years from now. So it seems to me to only be prudent to design your system to minimize the dependency between your highly customized Layer 3 components and the generic services provided by Layers 1 and 2.

If you currently have an XML-aware content management system, you might ask the question: to what degree are my business-process-specific components irrevocably tied to the generic parts of the system? Can I even clearly separate those two?

When Documentum first started rolling out their built-in XML support I was talking to the person who was the primary technical contact for those features. I don't remember the exact reason for the call but I do remember that I made it very clear that, as an integrator, I would never use Documentum's built-in user interface components for accessing the repository as part of an integrated solution for the simple reason that the user interface is a key part the integration investement my client would be making and it needs to be as independent of the underlying repository as much as possible. The user interface reflects the client's business processes and schemas and it costs a lot to develop so why should I tie myself to a single underlying system? She didn't seem to get it. Note that I'm not singling out Documentum here: I would provide the same answer for any repository: unless it can't be avoided, I won't do it.

Of course, some systems give you no choice by not providing an API by which you can bind your own UIs to the underlying system. As far as I'm concerned, those systems (and you know who you are) are not acceptable candidates regardless of any other features they might have.

You have to ask this question: if a repository vendor doesn't have enough confidence in the compelling value of his system as a set of services to let you use your own UI implementations, why not? If a vendor has to use UI lockin to keep you as a customer, that should make you think. Of course, the vendor may not have done this intentionally to create lock-in, it might just have seemed like an appropriate engineering choice. In which case I would say that the engineers or architects of the system don't fully understand the requirements that drive repository design because if they did they would have realized that not exposing a UI connection API is counter productive.

OK, maybe I'm being a little harsh. Maybe my bitterness as an integrator forced to work with such systems is coming through. But if you're going to charge someone millions of dollars for a system I think you have an obligation to think a bit more deeply about their larger requirements and not just about what's best or easiest for your software.

But that's just me, Bitter Integrator Guy.

The Bonnell system I mentioned before was designed as the antithesis of these types of systems. Bonnell was a collection of services and components separated by clear and public APIs designed to enable piecewise implementation and modification of a complete content management system. It was a repository framework or toolkit and satisfied our desires as integrators for the system we would like to be able to integrate on top of. As I've mentioned, for reasons beyond our control, the system is currently lost to the general public. We did use it in a project for one client, which went pretty well in that we were able to build the system per the client's specs. Unfortunately that client happended to be a telecom company that was a victim of the crash of 2000 and the collapse of the telecom market. And then DataChannel sold us to Innodata.

It should be no surprise that XIRUSS has the same basic architecture and design and goals and implementation model.

And to be fair in my ranting and bitterness, it's not just repository vendors but repository customers who don't think things through. I don't know how many projects we've had where a company choose a repository because it had a slick demo UI only to discover that it otherwise couldn't really satisfy their requirements, at least not without a lot of expensive integration work from us. While I don't mind making money doing integration work (that is after all my day job), I would rather do five straightforward projects that result in clean, stable, maintainble, reliable systems at a reasonable cost, rather than one painful project that results in a less-than optimal, hacked, fragile system that never completely satisfies the customer's requirements.

To summarize what we have so far:

- An XML-aware CMS can be modeled as a three-layered system consisting of generic storage object management (Layer 1), generic XML metadata, searching, and retrieval (Layer 2), and business-process- and schema-specific business logic (Layer 3).

- The cost of implementing layers 1 and 2 should be low (ideally free) and does not entail any particular technical difficulties other than those always imposed by large scales (and that would accrue to any similar information management system).

- Most of your investment in deploying an XML-aware content management system will be in Layer 3 where you are binding your specific business processes and application semantics to the generic facilities in layers 1 and 2. The expected service life of this layer is likely to be decades. You want to protect this investment as best you can.

One thing I should mention at this point: for some enterprises there will be no choice but to pony up non-trivial money to build a system that reflects the sophistication of their information and how it is used but there will be a return on that investment in terms of faster development, easier repurposing, better information quality, increased writer productivity, and so on. But for a lot of enterprises they simply can't justify that level of investment but they still need some basic XML content management features.

Fortunately, emerging standards like DITA (and to a lesser degree, DocBook before it) are starting to lower the cost of entry to sophisticated XML-aware content management for the simple reason that DITA provides a standard set of sophisticated linking and re-use semantics that can therefore be implemented as generic services. That is, vendors are now motivated to implement DITA-specific Layer-3 components. This means that if you are using DITA out-of-the-box your initial cost may only be the cost of licensing a given DITA-aware content management system. [I build some generic DITA awareness into XIRUSS-T specifically to demonstrate this point.]

If you are using a specialized DITA-based XML application, if your DITA-aware repository provides the appropriate APIs and whatnot, you should be able to implement the additional layer-3 features you need at minimal extra cost, because you only need to account for the delta, you don't have to build it from scratch.

Of course DITA is still in its formative stages as are commercial systems that support it, so it's not quite that simple in practice (for example, there are still a lot of interoperability issues that reflect ambiguities in the DITA spec itself or implementor misunderstanding [or willful perversion] of specific DITA features). But this picture should improve dramatically over the next few years.

Next up: Boundary Code Spelunking: Import and Export (really, this time I mean it)

Labels: XCMTDMW "xml content management" "cms characteristics"

Dr. Macro's XML Rants

Sunday, July 23, 2006

XCMTDMW: Characteristics of an XML CMS, Part 2

0 Comments:

About Me

Previous Posts