Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Friday, July 21, 2006

XCMTDMW: Characteristics of an XML CMS

I was going to talk about boundary complexity (see previous post) but I realized that before I can do that we need to have a clear understanding of what an XML CMS does (or should do).

For this discussion I will use the term "repository" to mean the overall content management system and the body of data it manages, is in "a repository of XML documents". I like "repository" because it's shorter than "XML CMS".

In addition, I feel that the term "XML CMS" is unnecessarily specialized. In my world, content management is a much more general problem and 90% of what you need to manage XML well applies to everything else too. That's another reason I chafe at over-specialized XML repositories--they really can't manage anything else. But what if you happen to have a 20-year legacy of Framemaker documents and you really don't have a need (or budget or time or patience or stomache) to convert them to XML just so you can get some decent content management features? If things are architected correctly, it shouldn't even really be an issue. But I digress.

There are lots of services and features that a repository could provide and there is of course a fuzzy boundary between features that are core repository features (such as raw data storage) and features that are really business-process-specific data processing that happen to be implemented at the repository level. That is, for a given enterprise or community of interest that needs a repository there will of course be requirements that are core repository requirements (check-in/check-out) and requirements that are unique to that enterprise or community ("on Tuesdays during the session we produce a full report of all ammendments that have gone through third reading. This creates a load spike that the system must handle without significant degration of other operations during that time"). Obviously, in this sort of general discussion I will be focusing on the former but you have to keep the latter in mind as well as you design the general architecture of the system such that the system will be able to react to those sorts of unpredictable and idiosyncratic requirements.

Another way to think about it is that a repository is a layered system, with core functionality, the functionality everybody needs, in the core components, with more and more specialized functionality built on top of those components as layers building outward. This is consistent with the separation-of-concerns/modular implementation view as well: each layer represents a more or less clear boundary between concerns that can be implemented as either a literal boundary (via an API) or as a layer of abstraction within the code. Each layer can in turn be organized into discrete components that further separate concerns within a layer. [This is pretty standard systems engineering but it still amazes me the number of commercial systems that either don't reflect this model at all (that is they are internally monolithic) or that may be implemented this way but don't usefully expose the component boundaries to the outside world. For shame.]

So what are the features that a repository needs to provide?

- Reliable, persistent versioned storage of storage objects (e.g., XML documents, Framemaker files, graphics, HTML pages, program source code, whatever).

By "storage object" I mean "file" or "resource" (in the HTTP sense) or whatever. In XML a syntactic XML document is exactly one storage object (it may make reference to other storage objects, such as external parsed entities).

In XML terms, this means the repository must manage XML documents in the strict syntactic meaning of the term "document" in the XML specification. Note the plural. Many purported XML document management systems in fact manage exactly one document, making them a "document management system". What you usually want is a "documents management system".

By "versioned" I mean "versioned in time" such that for the same logical storage object ("resource" in the terminology I normally use) you can retrieve its data state as it was at any point in time. The simplest useful versioning system only needs to provide a single sequence of versions for a given resource--that is, it does not need to support branching [in my experience, very few people who use versioning systems, even very sophisticated people, actually use branching. It introduces any number of complexities that are best avoided. The only exception to this tends to be in code bases for products where you have to maintain multiple release versions and therefore have no choice but to do branching and merging of code. But in those cases the process is very carefully controlled and managed. In the context of document authoring, with the possible exception of legislative bill drafting systems, I have never seen either a compelling need for branching and merging or a body of authors who could reliably take advantage of it.]

Note that the other meaning of "version", that is different alternative versions of the logical content at a single moment in time is a linking problem (or at least it can be well modeled as a linking problem) and not a core data storage problem. For example, problems surrounding the management of localized documents where you have different language versions of the same base text is either a linking problem or an XML schema design problem, not a core storage problem. Of course some systems that are optimized for managing localized documents do address these through their core storage mechanisms. I feel strongly that that's the wrong way to go about it.

Thus I distinguish "storage object version" from "alternative version". Storage object versioning is purely data storage and is only about versions in time. Alternative versions are semantic concepts orthogonal to storage object versioning.

Note that per my previous post, a system like Subversion or CVS satisfies the requirement for reliable, persistent versioned storage of storage objects. Subversion is a little bit better because it provides support for all Unicode encodings and binary diffing, but CVS will work (as I've demonstrated).

- Distributed access to storage object versions. That is, you have to be able to get the data back out and that access needs to be distributed in the sense that the repository and its clients do not need to be on the same physical machine or running in the same process space. There are many ways to do distribution but in today's world providing an HTTP-based access method (i.e., a Web server) is a minimum requirement. Other technologies, like CORBA, DCOM, proprietary internet protocols (i.e., CVS's protocol, which predates HTTP by 10 years or more) can also be useful. But if it does HTTP you can pretty much play with anybody.

- Provide appropriate storage object access control such that, at a minimum, users who are not authorized to access a given storage object cannot do so. Some scenarious may futher require, for example, that users not even be able to see evidence that particular storage object exists, much less access its content or metadata. Systems like CVS and Subversion (and most operating system file systems) provide sufficient minimum access control features.

- Provide appropriate locking or conflict resolution facilities such that one user cannot destroy another user's changes. This is usually expressed as either write locking or optimistic locking. Both approaches are useful. CVS and Subversion use optimistic locking (checking for conflicts on commit) but a lot of content management systems use write locking (check out, check in). I prefer optimistic locking as a rule but it's not a religious issue, rather one driven by what will work best for the users and the business process.

Up to this point simple off-the-self versioning systems like CVS and Subversion at least minimally satisfy the requirements. [Update: Stan D points out that Subversion provides support for storage object metadata as well., so this statement can be moved down below the next bullet.]

- Management of arbitrary metadata associated with storage objects. Storage objects can of course have lots of metadata associated with them, such as who created it when, where it is in some defined workflow sequence, what larger collection it is a member of, its root element type, the schema that governs it (if it's an XML document), the namespaces it reflects (if it's an XML document), and so on. A repository should provide a generic mechanism for setting and interrogating these metadata values.

Note that here I'm not talking about higher-level metadata, such as descriptors on individual XML elements, except to the degree that that metadata might appropriately characterize the entire storage object.

For example, in a DITA-aware repository you might choose to manage each topic as a separate storage object. In that case it might make sense to capture all the topic-level metadata (keywords, index terms, etc.) and reflect them as storage object metadata. Or it might not. But if you managed multiple topics in a single storage object it probably wouldn't make sense to put all the descriptors from all the topics as metadata on the single storage object that contains them.

Rest assured that I will require a higher-level metadata management facility (otherwise known as an index [see the previous post]).

This is the first requirement that cannot be satisfied by CVS or Subversion directly. There may be other generally available versioning systems that do provide an appropriate metadata feature but I haven't looked for one yet. I did however, implement one (see the initial post in this series). In any case, this sort of metadata management is not hard--at its simplest it's just a sequence of name/value pairs that can be easily captured in a relational database or using generic objects or whatever. It's not a hard problem.

- Management of knowledge of dependency relationships among storage objects, both version-specific and version-independent.

Almost any non-trivial data will, in its semantics, express or imply dependencies with other storage objects. This is certainly true for documentation. For example, a Framemaker file might have a reference to an external graphic, an XML document may be governed by a schema or link to another document or include a graphic in some way or include an external unparsed entity. Regardless of their specific semantics and meaning, these all define dependencies between the storage objects involved. For example, an XML document that is governed by a schema establishes a dependency between the storage object that is the XML document and the storage object that is the root storage object of the schema.

A number of operations that need to be performed require you to know, for a given starting storage object, what other storage objects it depends on. For example, if you want to package up all the storage objects needed in order to render a publication you need to know what the root XML document is and what other files it depends on are: schema, graphics, XIncluded (or conreffed) documents, and so on.

You also need to be able to answer the "where used?" question. If, for example, I want to delete or significantly change a storage object I need to be able to find where it is used (what other storage objects depend on it?) in order to, for example, determine if there are no dependencies or, if there are, inform the owners of those other storage objects of the impending (or performed) change.

At its simplest, the facility can simply capture "A depends on B". In practice, it is usually useful to be able to type the dependency so that processes can make informed decisions based on different dependency types. For example, in XML, you can usefully process a document even if you don't have access to any graphics it might point to but you cannot process it if you don't have access to any external parsed entities it includes. Therefore, an export processor needs to know if a given dependency is a "parsed entity reference" dependency or a "graphic inclusion reference". For any well-defined data format there will be a set of defined dependency types and you should expect a format-aware repository to have those types built in. In addition, there will be dependency types specific to a given enterprise's or community's data. Therefore, you should expect the repository to allow you to define new types.

In addition, these dependencies should be first-class objects with their own metadata (of which the relationship type is one instance, possibly priveledged).

The repository should provide facilities by which you can examine the dependencies, get their types, metadata, traverse them in either direction ("where used") and so forth.

By "version dependent and version independent" I mean that dependencies must be able to be expressed in terms of exact version-to-version relationships ("this is a link from version 1.3 of object A to version 2.6 of object B") or expressed in terms of rules, that when applied at a particular point in time, will resolve to one version or another "this is a link from object A to the current (latest) version of object B" or "this is a link from object A to the version of object B whose metadata properties X, Y, and Z match the values S, T, and U". This is where things start to get a little tricky, but not impossibly so. But systems that only provide direct version-to-version relationships or only provide version-to-latest relationships are not, in the general case, useful, because they are much too limiting for most non-trivial use cases. There are some documentation environments where document life cycle is so short that older versions are of no interest but those environments are rare. In the typical software or consumer product documentation case there can be quite complex version-to-version dependencies and the repository needs to support those complexities.

This facility is not, by itself, link management as most people think of it. It is, however, a prerequisite for complete link management. By "link management" I mean mananging the knowledge of and resolution of individual links within storage objects.

For example, say XML document A has an XInclude of some element from document B and it includes that element about 100 times (it's a warning used in every subtask in every procedure in the engine manual for the X104 personal spacecraft). That's 100 individual links of type "XInclude" but it can be captured as a single storage-object-to-storage-object dependency (object A Xincludes from Object B).

It's the core repository's job to know that A depends on B for a given reason. It's the link manager's job to know about all 100 XIncludes from A to B and to provide the services for resolving or modifying those XIncludes as needed. That problem is one that I will talk about in much more detail before too long, as it's the problem (versioned hyperdocument management) that has most engaged me for the last 15 years or so.

But you have to walk before you can run and not everybody does a lot of linking (or can afford a full link management system). So just having basic dependency knowledge is very important and goes a long way.

Finally, storage-object-to-storage-object dependencies need not be inherent in the storage objects. They might reflect dependencies created as a side effect of some business process. For example, you might have documentation for two related products that will be released to market at the same time. This imposes as schedule dependency between the the publications for those products even if there are no other relationships among them. This scheduling dependency could be captured in the repository as a dependency between the root documents of the publications involved. Given this dependency you could, for example, create a report that looks at your publication, determines what its publication date is (which could be captured as metadata on the storage object), look for any "published with" dependencies, find those publication roots, then traverse the tree of dependencies from those roots to see if any subcomponents are not far enough along in their development workflows. I.e., if all the components should be in the "approved" state (again, a metadata value on the storage objects) but some are still in the "under review" state, you probably want to track down the owners (another storage object metadata value) and ask what's up and when do they expect to get those things approved?

You get the idea.

By capturing just dependencies at the storage object level you are again separating concerns (storage-object-specific metadata from semantic object (element-level) metadata) and enabling a large set of useful functions without the overhead and expense of a full link management system.

Everyone needs to know about some set of storage-object-to-storage-object dependencies. Not everyone needs link management (and even when they do they can limp along without it quite effectively for a long time--buy me a beer and I'll tell you about my first non-trivial link management exercise that was implemented entirely using line-oriented text editors and some simple markup conventions).

At this point we have the minimum features needed for complete version-aware, dependency-aware storage object management. These features are a prerequisite for more sophisticated semantic management that is applied to the content within the storage objects.

Please note a few things about the foregoing set of requirements:

- Most of it is satisfied by off-the-shelf versioning systems like CVS and Subversion

- The storage object metadata capture and access is not a hard problem and can implemented in quite simple ways. The simple CVS and Gadfly approach I took at Woodward would work just fine for storage object metadata as well in a lot of use environments.

- For dependencies that are inherent in the semantic or syntactic content of the storage objects, is is not absolutely necessary to hold those persistently. That is, any persistent storage of knowledge of those relationships is just optimization of what could be done by brute force. That is, in theory, to answer any dependency question you could just examine all the storage objects to see what they point to. Of course this won't scale (but it might well perform for smaller data sets) but the point is you don't absolutely have to start with a sophisticated database to hold the dependency information.

- All the features I've talked about are not in any way XML specific. That is, a system with these characteristics can be used to manage any type of storage objects whatsover.

- I still haven't talked about how you get the metadata or capture the dependency knowledge. That's all boundary stuff and I'll get to it soon.

With a system that does just the things outlined here you've gone a long way toward satisfying your XML (or equivalent structured data) management requirements. Of course I haven't said anything about, for example, full-text search and retrieval, but if you've been paying attention you can probably predict what I'll say about how to satisfy that requirement....

Next up: semantic processing features of a repository

Labels:

8 Comments:

Anonymous Anonymous said...

Eliot,

You say that management of arbitrary metadata for storage objects is a requirement that cannot be accomodated by CVS or Subversion. I think that is true for CVS, but what about Subversion properties. Why do they not satisfy this requirement?

Interesting stuff.

11:29 AM  
Blogger John Cowan said...

The reason that you don't see any need for more than simple time-oriented versioning is that branching and merging is indeed too obnoxious, not because simple versioning is really adequate.

In particular, it's totally inadequate for situations where people work on the same document offline in an attempt to converge on a final result. Lawyers negotating a contract are a good example: my lawyer sends his draft to your lawyer, who makes changes and sends them back, etc. etc.

The trouble is, of course, that the changes have to be batched and the protocol strictly ping-pong in order to avoid hopeless confusion. If my lawyer finds a totally new change that he thinks is needed, he simply has to wait until your lawyer sends back the new version, even if the change is in a totally separate part of the document.

The solution, then, is to promote change-oriented revision control systems, where instead of versions there are changes, and the question is, Which changes are applied to produce what? For any given subset of the available set of changes, there is a deterministic procedure to apply those changes and no others.

Darcs is a simple example of such a system, though its view of what counts as a change is too weak IMHO to be useful for full-fledged documents as opposed to code.

Fortunately, David Durand has already done the heavy lifting for figuring out how to reconcile more general types of changes (dynamic move and dynamic copy as well as insert and delete) in his Palimpsest model, which only needs a good implementation framework.

FWIW, using change-oriented revision control also eliminates the artificial distinction between revisions and alternative versions; if we admit transformations as a type of change (which is a straightforward extension of Palimpsest), then having an alternative version is a matter of applying or not applying a particular transformational change.

11:41 AM  
Blogger Eliot Kimber said...

I wasn't aware of that feature of Subversion. I must admit that what I know of Subversion is that it is a good replacement for CVS that offers more features without completely breaking the paradigm. I haven't actually looked into it deeply or actually used it in practice.

So it's quite likely that I've understated its capabilities. I will investigate as soon as I can and revise my statements as appropriate.

1:04 PM  
Anonymous Anonymous said...

Properties were one of the big selling points of Subversion for me. In short, it allows you to associate an arbitrary set of key-value pairs to any file or directory in your version tree.

Properties are themselves versioned and values can take either text or binary form.

The Subversion system itself takes advantage of properties (via reserved property names) to do things like assign a MIME-type to a file or designate a file for automatic keyword expansion.

-Stan

3:38 PM  
Anonymous Anonymous said...

Eliot:

Good stuff as usual. I blogged about my post at my site, www.thecontentwrangler.com today. Thanks for begining to demystify an important topic.

6:20 AM  
Blogger Eliot Kimber said...

In response to John's comments, I will say that in advance of taking more time to think more deeply about them that I probably agree with them. I gave a paper at the first HyTime conference on using hyperlinks to represent changes between versions so it seems likely that David Durrand's thinking is consistent with that thinking.

My focus on the use of tools like CVS and Subversion is simply one of economy and convenience.

The issue of management and reconciliation of offline changes is another serious problem. It's something we thought about alot in developing the Bonnell system and the SnapCM model but we never had the chance to put our ideas to the test.

I think one hope is that by the time the issue needs to be solved generally that network infrastructure will be so ubiquitous that it won't be an issue any more. We're pretty close today, what with free WiFi and mobile data services and whatnot, but we're not quite there.

This is a problem for which, while I can imagine sophisticated technical solutions, I'm hoping that by procrastinating it will go away.

I will also say that I've come to believe pretty strongly in the Extreme Programming principle that there's no substitute for clear and constant human communication. A lot of technical solutions around content management (locking, merging, workflow, etc.) are really attempts to find technological solutions to what are essentially human communication issues. Something as simple as asking your teammate "hey, are you working on this file? 'Cause I need to modify something" can go a long way.

8:34 AM  
Anonymous Anonymous said...

Your post is an excellent overview of all the complications one has to think about, I very much enjoyed reading it. I wanted to point out, in case you weren't already aware, that the newer versions of Subversion now support exclusive locking:
http://svnbook.red-bean.com/nightly/en/svn.ref.svn.c.lock.html

9:15 PM  
Blogger Eliot Kimber said...

Yes, I should have said that CVS and Subversion use optimistic locking by default. Both CVS and Subversion allow you to lock files, although I've never used that feature myself.

9:50 AM  

Post a Comment

<< Home