XCMTDMW: Characteristics of an XML CMS
For this discussion I will use the term "repository" to mean the overall content management system and the body of data it manages, is in "a repository of XML documents". I like "repository" because it's shorter than "XML CMS".
In addition, I feel that the term "XML CMS" is unnecessarily specialized. In my world, content management is a much more general problem and 90% of what you need to manage XML well applies to everything else too. That's another reason I chafe at over-specialized XML repositories--they really can't manage anything else. But what if you happen to have a 20-year legacy of Framemaker documents and you really don't have a need (or budget or time or patience or stomache) to convert them to XML just so you can get some decent content management features? If things are architected correctly, it shouldn't even really be an issue. But I digress.
There are lots of services and features that a repository could provide and there is of course a fuzzy boundary between features that are core repository features (such as raw data storage) and features that are really business-process-specific data processing that happen to be implemented at the repository level. That is, for a given enterprise or community of interest that needs a repository there will of course be requirements that are core repository requirements (check-in/check-out) and requirements that are unique to that enterprise or community ("on Tuesdays during the session we produce a full report of all ammendments that have gone through third reading. This creates a load spike that the system must handle without significant degration of other operations during that time"). Obviously, in this sort of general discussion I will be focusing on the former but you have to keep the latter in mind as well as you design the general architecture of the system such that the system will be able to react to those sorts of unpredictable and idiosyncratic requirements.
Another way to think about it is that a repository is a layered system, with core functionality, the functionality everybody needs, in the core components, with more and more specialized functionality built on top of those components as layers building outward. This is consistent with the separation-of-concerns/modular implementation view as well: each layer represents a more or less clear boundary between concerns that can be implemented as either a literal boundary (via an API) or as a layer of abstraction within the code. Each layer can in turn be organized into discrete components that further separate concerns within a layer. [This is pretty standard systems engineering but it still amazes me the number of commercial systems that either don't reflect this model at all (that is they are internally monolithic) or that may be implemented this way but don't usefully expose the component boundaries to the outside world. For shame.]
So what are the features that a repository needs to provide?
- Reliable, persistent versioned storage of storage objects (e.g., XML documents, Framemaker files, graphics, HTML pages, program source code, whatever).
By "storage object" I mean "file" or "resource" (in the HTTP sense) or whatever. In XML a syntactic XML document is exactly one storage object (it may make reference to other storage objects, such as external parsed entities).
In XML terms, this means the repository must manage XML documents in the strict syntactic meaning of the term "document" in the XML specification. Note the plural. Many purported XML document management systems in fact manage exactly one document, making them a "document management system". What you usually want is a "documents management system".
By "versioned" I mean "versioned in time" such that for the same logical storage object ("resource" in the terminology I normally use) you can retrieve its data state as it was at any point in time. The simplest useful versioning system only needs to provide a single sequence of versions for a given resource--that is, it does not need to support branching [in my experience, very few people who use versioning systems, even very sophisticated people, actually use branching. It introduces any number of complexities that are best avoided. The only exception to this tends to be in code bases for products where you have to maintain multiple release versions and therefore have no choice but to do branching and merging of code. But in those cases the process is very carefully controlled and managed. In the context of document authoring, with the possible exception of legislative bill drafting systems, I have never seen either a compelling need for branching and merging or a body of authors who could reliably take advantage of it.]
Note that the other meaning of "version", that is different alternative versions of the logical content at a single moment in time is a linking problem (or at least it can be well modeled as a linking problem) and not a core data storage problem. For example, problems surrounding the management of localized documents where you have different language versions of the same base text is either a linking problem or an XML schema design problem, not a core storage problem. Of course some systems that are optimized for managing localized documents do address these through their core storage mechanisms. I feel strongly that that's the wrong way to go about it.
Thus I distinguish "storage object version" from "alternative version". Storage object versioning is purely data storage and is only about versions in time. Alternative versions are semantic concepts orthogonal to storage object versioning.
Note that per my previous post, a system like Subversion or CVS satisfies the requirement for reliable, persistent versioned storage of storage objects. Subversion is a little bit better because it provides support for all Unicode encodings and binary diffing, but CVS will work (as I've demonstrated).
- Distributed access to storage object versions. That is, you have to be able to get the data back out and that access needs to be distributed in the sense that the repository and its clients do not need to be on the same physical machine or running in the same process space. There are many ways to do distribution but in today's world providing an HTTP-based access method (i.e., a Web server) is a minimum requirement. Other technologies, like CORBA, DCOM, proprietary internet protocols (i.e., CVS's protocol, which predates HTTP by 10 years or more) can also be useful. But if it does HTTP you can pretty much play with anybody.
- Provide appropriate storage object access control such that, at a minimum, users who are not authorized to access a given storage object cannot do so. Some scenarious may futher require, for example, that users not even be able to see evidence that particular storage object exists, much less access its content or metadata. Systems like CVS and Subversion (and most operating system file systems) provide sufficient minimum access control features.
- Provide appropriate locking or conflict resolution facilities such that one user cannot destroy another user's changes. This is usually expressed as either write locking or optimistic locking. Both approaches are useful. CVS and Subversion use optimistic locking (checking for conflicts on commit) but a lot of content management systems use write locking (check out, check in). I prefer optimistic locking as a rule but it's not a religious issue, rather one driven by what will work best for the users and the business process.
Up to this point simple off-the-self versioning systems like CVS and Subversion at least minimally satisfy the requirements. [Update: Stan D points out that Subversion provides support for storage object metadata as well., so this statement can be moved down below the next bullet.]
- Management of arbitrary metadata associated with storage objects. Storage objects can of course have lots of metadata associated with them, such as who created it when, where it is in some defined workflow sequence, what larger collection it is a member of, its root element type, the schema that governs it (if it's an XML document), the namespaces it reflects (if it's an XML document), and so on. A repository should provide a generic mechanism for setting and interrogating these metadata values.
Note that here I'm not talking about higher-level metadata, such as descriptors on individual XML elements, except to the degree that that metadata might appropriately characterize the entire storage object.
For example, in a DITA-aware repository you might choose to manage each topic as a separate storage object. In that case it might make sense to capture all the topic-level metadata (keywords, index terms, etc.) and reflect them as storage object metadata. Or it might not. But if you managed multiple topics in a single storage object it probably wouldn't make sense to put all the descriptors from all the topics as metadata on the single storage object that contains them.
Rest assured that I will require a higher-level metadata management facility (otherwise known as an index [see the previous post]).
This is the first requirement that cannot be satisfied by CVS or Subversion directly. There may be other generally available versioning systems that do provide an appropriate metadata feature but I haven't looked for one yet. I did however, implement one (see the initial post in this series). In any case, this sort of metadata management is not hard--at its simplest it's just a sequence of name/value pairs that can be easily captured in a relational database or using generic objects or whatever. It's not a hard problem.
- Management of knowledge of dependency relationships among storage objects, both version-specific and version-independent.
Almost any non-trivial data will, in its semantics, express or imply dependencies with other storage objects. This is certainly true for documentation. For example, a Framemaker file might have a reference to an external graphic, an XML document may be governed by a schema or link to another document or include a graphic in some way or include an external unparsed entity. Regardless of their specific semantics and meaning, these all define dependencies between the storage objects involved. For example, an XML document that is governed by a schema establishes a dependency between the storage object that is the XML document and the storage object that is the root storage object of the schema.
A number of operations that need to be performed require you to know, for a given starting storage object, what other storage objects it depends on. For example, if you want to package up all the storage objects needed in order to render a publication you need to know what the root XML document is and what other files it depends on are: schema, graphics, XIncluded (or conreffed) documents, and so on.
You also need to be able to answer the "where used?" question. If, for example, I want to delete or significantly change a storage object I need to be able to find where it is used (what other storage objects depend on it?) in order to, for example, determine if there are no dependencies or, if there are, inform the owners of those other storage objects of the impending (or performed) change.
At its simplest, the facility can simply capture "A depends on B". In practice, it is usually useful to be able to type the dependency so that processes can make informed decisions based on different dependency types. For example, in XML, you can usefully process a document even if you don't have access to any graphics it might point to but you cannot process it if you don't have access to any external parsed entities it includes. Therefore, an export processor needs to know if a given dependency is a "parsed entity reference" dependency or a "graphic inclusion reference". For any well-defined data format there will be a set of defined dependency types and you should expect a format-aware repository to have those types built in. In addition, there will be dependency types specific to a given enterprise's or community's data. Therefore, you should expect the repository to allow you to define new types.
In addition, these dependencies should be first-class objects with their own metadata (of which the relationship type is one instance, possibly priveledged).
The repository should provide facilities by which you can examine the dependencies, get their types, metadata, traverse them in either direction ("where used") and so forth.
By "version dependent and version independent" I mean that dependencies must be able to be expressed in terms of exact version-to-version relationships ("this is a link from version 1.3 of object A to version 2.6 of object B") or expressed in terms of rules, that when applied at a particular point in time, will resolve to one version or another "this is a link from object A to the current (latest) version of object B" or "this is a link from object A to the version of object B whose metadata properties X, Y, and Z match the values S, T, and U". This is where things start to get a little tricky, but not impossibly so. But systems that only provide direct version-to-version relationships or only provide version-to-latest relationships are not, in the general case, useful, because they are much too limiting for most non-trivial use cases. There are some documentation environments where document life cycle is so short that older versions are of no interest but those environments are rare. In the typical software or consumer product documentation case there can be quite complex version-to-version dependencies and the repository needs to support those complexities.
This facility is not, by itself, link management as most people think of it. It is, however, a prerequisite for complete link management. By "link management" I mean mananging the knowledge of and resolution of individual links within storage objects.
For example, say XML document A has an XInclude of some element from document B and it includes that element about 100 times (it's a warning used in every subtask in every procedure in the engine manual for the X104 personal spacecraft). That's 100 individual links of type "XInclude" but it can be captured as a single storage-object-to-storage-object dependency (object A Xincludes from Object B).
It's the core repository's job to know that A depends on B for a given reason. It's the link manager's job to know about all 100 XIncludes from A to B and to provide the services for resolving or modifying those XIncludes as needed. That problem is one that I will talk about in much more detail before too long, as it's the problem (versioned hyperdocument management) that has most engaged me for the last 15 years or so.
But you have to walk before you can run and not everybody does a lot of linking (or can afford a full link management system). So just having basic dependency knowledge is very important and goes a long way.
Finally, storage-object-to-storage-object dependencies need not be inherent in the storage objects. They might reflect dependencies created as a side effect of some business process. For example, you might have documentation for two related products that will be released to market at the same time. This imposes as schedule dependency between the the publications for those products even if there are no other relationships among them. This scheduling dependency could be captured in the repository as a dependency between the root documents of the publications involved. Given this dependency you could, for example, create a report that looks at your publication, determines what its publication date is (which could be captured as metadata on the storage object), look for any "published with" dependencies, find those publication roots, then traverse the tree of dependencies from those roots to see if any subcomponents are not far enough along in their development workflows. I.e., if all the components should be in the "approved" state (again, a metadata value on the storage objects) but some are still in the "under review" state, you probably want to track down the owners (another storage object metadata value) and ask what's up and when do they expect to get those things approved?
You get the idea.
By capturing just dependencies at the storage object level you are again separating concerns (storage-object-specific metadata from semantic object (element-level) metadata) and enabling a large set of useful functions without the overhead and expense of a full link management system.
Everyone needs to know about some set of storage-object-to-storage-object dependencies. Not everyone needs link management (and even when they do they can limp along without it quite effectively for a long time--buy me a beer and I'll tell you about my first non-trivial link management exercise that was implemented entirely using line-oriented text editors and some simple markup conventions).
At this point we have the minimum features needed for complete version-aware, dependency-aware storage object management. These features are a prerequisite for more sophisticated semantic management that is applied to the content within the storage objects.
Please note a few things about the foregoing set of requirements:
- Most of it is satisfied by off-the-shelf versioning systems like CVS and Subversion
- The storage object metadata capture and access is not a hard problem and can implemented in quite simple ways. The simple CVS and Gadfly approach I took at Woodward would work just fine for storage object metadata as well in a lot of use environments.
- For dependencies that are inherent in the semantic or syntactic content of the storage objects, is is not absolutely necessary to hold those persistently. That is, any persistent storage of knowledge of those relationships is just optimization of what could be done by brute force. That is, in theory, to answer any dependency question you could just examine all the storage objects to see what they point to. Of course this won't scale (but it might well perform for smaller data sets) but the point is you don't absolutely have to start with a sophisticated database to hold the dependency information.
- All the features I've talked about are not in any way XML specific. That is, a system with these characteristics can be used to manage any type of storage objects whatsover.
- I still haven't talked about how you get the metadata or capture the dependency knowledge. That's all boundary stuff and I'll get to it soon.
With a system that does just the things outlined here you've gone a long way toward satisfying your XML (or equivalent structured data) management requirements. Of course I haven't said anything about, for example, full-text search and retrieval, but if you've been paying attention you can probably predict what I'll say about how to satisfy that requirement....
Next up: semantic processing features of a repository