Dr. Macro's XML Rants: July 2006

Monday, July 31, 2006

XCMTDMW: Element to Element Linking: Overview

What do we mean by "linking" in the context of XML document processing?

The most general definition is "a semantic object that establishes a set of one or more relationships among uniquely-addressible XML components". This definition is reflected by the XLink and HyTime standards, which provide syntax and semantics for establishing arbitrarily-complex relationships between arbirarily-addressible things. (XLink is limited to the domain of linking among XML componentsXLink is limited to the domain of linking among components for which a URL-compatible fragment addressing syntax and semantic have been defined, HyTime provides generic facilities for making anything generically addressable and therefore enables linking anything to anything via a single standard representation mechanism (groves)). [See the comments to this post for more good discussion around my original mistatement of XLink's limitations. The distinction between XLink and HyTime in this area is subtle but important: HyTime provides a generic mechanism by which you can define the "in-memory" addressing representation of anything (the downside is somebody has to define it and somebody has to implement the instantiation of the representation). By contrast, XLink is dependent on different defined data formats (XML, HTML, CGM, MS Word, whatever) defining, as IETF or W3C specifications, what their addressible components are and what the syntax for doing that addressing is. If there is no such specification XLink can't link it. This is one reason it often seems like a good idea to use XML to encode everything: it makes it universally addressible. It's true but it's not the only way it could be done. The Web world could easily define the functional equivalent of HyTime groves and XLink/XPointer could then be defined in terms of addresses of things in terms of that generic grove-like thing. But I don't see that happening any time soon. In addition, XLink addressing is done via URIs exclusively, which should not be a limitation in practice but is another difference--HyTime is so flexible that almost any reasonable way of writing down addresses using SGML or XML markup can be made recoganizable to a HyTime processer--a degree of flexibility that made HyTime difficult for many people to understand, but I digress).]

For example, using XLink you could semantically link words in one section of PCDATA to words in another section just as easily as you can link one element to another.

While that level of generality is sometimes useful and it's important for standards like XLink and HyTime and XPointer to both enable it and standardize it clearly and completely, for the purposes of both discussing the issues inherent in linking and in doing workaday technical documentation, we can narrow our focus to a key subset of the general case, linking one element to another element or set of elements.

First, let me explain my repeated stress on the word "semantic". A link is a semantic relationship whose meaning is independent of how the relationship is established. Think of it like marriage: it doesn't matter whether you're married in a church or a justice of the peace, in Austin or Amsterdam, in English or in Chinese, the resulting relationship is the same: A is married to B.

By the same token it doesn't matter how a link is expressed syntactically in your data: XLink, XIinclude, HyTime, HTML, your own 20-year-old link markup, the relationships will be the same: Element A is linked to Element B for some reason.

By the same token, addressing, on which semantic linking depends, is entirely syntactic. Addressing is the plumbing or mechanics that let you physically connect things together: the pointers. The addressing syntax you use has many practical implications, including the availability of implementations, the cost of implementation and processing, the opportunities for interoperation, and so on, but the specific syntax you use doesn't affect the meaning of the relationships established by the links that do the addressing.

That is, it doesn't matter whether you arrive at your wedding via train or car or pontoon boat, the end result is the same. The cost and speed and availability are different but as long as you get there on time, it doesn't matter which one you use.

This is very important. Clear thinking about linking requires that you be able to make a complete and clear distinction between the syntax-independent and syntax-specific parts of linking.

And clear thinking is the only way we will be able to find our way safely to a generalized approach to XML document and link management that can both satisfy all our requirements and not require crazy optimizations.

My promise to you is that if you stick with me in this exploration of the intricacies and pitfalls of linking is that when we come out the other end you will have at your disposal a general architectural approach to link management that can be implemented simply or sophisticatedly as you require that will do everything you need it to do at a cost exactly proportional to your specific needs in terms of scale, performance, and completeness. That is, if you need to do simple element-to-element linking there is a simple solution that is completely compatible with and upgradable to the most sophisticated system that lets you link anything to anything. We already saw this with the Woodward Governor system. You do not need a hugely-expensive, overoptimized XML-aware CMS as cost of entry to doing sophisticated linking. You may need one eventually if your scale and performance requirements are high. But it is likely that in fact you need something less daunting and expensive.

OK, back to linking.

Here are some facts that help us narrow our problem statement while preserving our ability to do more sophisticated things in the future:

- You can always change the addressing syntax without changing the semantics of the links. For example, you can change from using only ID references to using full XPointers without changing the meaning of any links as long as the addressed result is the same.

- Any link expressed as an element that points directly to another element (an "inline link") can be replaced with an "out-of-line" link that points to the two original elements without changing the semantics of the relationship expressed. The reverse is also true.

- Doing one-to-many linking or many-to-many linking is no harder than doing one-to-one linking in a generalized system. It mostly becomes a user interface issue (which is why HTML doesn't directly allow it).

- The nature of the things linked doesn't change the general nature of the issues inherent in doing semantic linking and addressing.

- Most of the complexity of link processing and management is in the addressing.

- Most of the complexity of addressing comes from managing addresses within a body of information under revision. If your data is static and unchanging, addressing is easy, just a simple matter of programming. It's when your data changes over time that things get interesting.

- The core requirements for linking and addressing in authoring support repositories and delivery support requirements are fundamentally different. In particular, authoring repositories must provide sophisticated mechanisms for doing indirect addressing while delivery repositories need not do any indirection and would rather not do any (in order to keep things as simple and quick as possible).

Taken together, these facts mean the following for us:

- We can focus on the simplest case, element-to-element links and know that the same issues and principles will apply to less common cases, such as element-to-text links.

- Our choice of addressing method will be the primary determiner of the cost of our system in terms of both cost to implement and cost to use.

I will also observe that in most technical documentation linking is limited to element-to-element links and usually to strictly binary links of one element to one element, for the simple reason that doing anything more sophisticated is challenging for writers from a rhetorical standpoint and is complicated by the inherent lifecycle management challenges posed by linking in general. That is, doing more than simple links is just too hard in most cases.

[NOTE: In examples that follow I will omit namespace declarations just to keep the examples simple but my policy is that all elements should be in a namespace other than the no-namespace namespace. Just so we're clear.]

OK, let's pull the covers off a link and see what makes it tick. Let's start with one we've seen, an XInclude link:

<?xml version="1.0?>
<doc>
  ...
  <:xi:include href="../common/warnings/dont_run_scissors.xml"/>
  ...
</doc>

Here we have a simple XInclude "include" link. This link is establishing a relationship between itself, the <xi:include> element, and the element that is the document element of the XML document named by the href= attribute. The semantics of the relationship are defined by the XInclude specification and are "transclude" or "use-by-reference".

Note that this is not a link between the xi:include element and document entity "dont_run_scissors.xml". It is also not a link between the document that contains the xi:include element and the document entity. It is a link from one element, the xi:include element to another element, the document element of the document entity named. This is very important and if you aren't seeing the distinction we need to stop now and make sure you do see it because this is crucial to our understanding going forward. To make it clearer, lets look at dont_run_scissors.xml:

<xml version="1.0"?>
<warning>
<p>Don't run with scissors.</p>
</warning>

The relationship established by the xi:include element is between itself and the <warning> element that happens to be the document element of dont_run_scissors.xml.

Why is this? It's because XInclude defines a useful shortcut which is that, by definition (not just by convention), a reference to a document entity with no explicit XPointer is a reference to that document entity's document element.

Let's make this clear by changing our data a bit. Let's aggregate all our standard warnings into a single document for convenience:

<xml version="1.0"?>
<warning_set>
<warning>
<p>Don't run with scissors.</p>
</warning>
<warning_set>
<warning>
<p>Don't stand on the top rung of a step ladder</p>
</warning>

Now let's create a new version of our linking document to reflect this new organization of warnings [NOTE: I'm pretty sure my xpointer syntax is not complete. I'm keeping it simple for example purposes. See the spec for the exactly correct syntax]:

<?xml version="1.0?>
<doc>
  ...
  <:xi:include href="../common/warnings/warnings.xml"
       xpointer="xpointer(/*/warning[1])"
/>
  ...
</doc>

What have we changed? Because the warning we want is no longer a document element (it's no longer the root element of its containing document), we can't use just an href=--we have to add an xpointer= in order to address the element we want. So we've added an xpointer= attribute with an XPointer that addresses the first warning in the new warning_set document.

The relationship is still the the same: the xi:include element is pointing to the don't run with scissors warning. The addressing has changed (because the data changed) but the semantics are the same and the processing result will be the same.

And note that it doesn't matter how we address the target warning. Here I made the smallest possible change to the warning data (added a wrapper warning_set element) but I didn't change the target warning at all. In particular I didn't do what a lot of people would either assume is required or do instinctively: add an ID to the warning.

This is to make the point that how you do addressing doesn't frickin' matter as regards the semantics of the links. The only questions are "how hard is it to create the pointer in the first place and how hard will it be to resolve?" As it happens, with XPointer, most of it is pretty easy and you can do it in XSLT 1 (and it's really easy with XSLT 2). I've done it and I make that XSLT code freely available (I believe an older version is somewhere on the XSL FAQ site--I have a newer version that supports XSLT 2 but I need to post it somewhere). In any case, it's not that hard and it gets easier every day.

Have I made my point about addressing vs semantics? I hope so because it's crucial to making everything work. In particular, if you can't change the form of address without changing the meaning of your links, link management would be very hard indeed.

Having said that, it's also the case that the form of address you choose will affect many practical aspects of the system. In particular, if you choose a form of address that is not standards based (that is, is not XPointer or some form of schema-defined key/keyref) then you are at a minimum increasing the cost of implementation because you'll be on the hook for all the code components that have to work with those addresses (both to create them in new documents and to resolve them during processing). If the addressing mechanism is specific to a product (for example, references to object IDs in some proprietary repository) then you've tied yourself to that repository at the data level which I think is a very dangerous thing to do and should only be done when there is no alternative (and there's always an alternative).

Note too that if your address is to object IDs in a repository you are doing exactly what we did above when we used just the href= to point to dont_run_scissors.xml: you're addressing a storage object in order to address its root element. That is, any system that decomposes documents at the element level is making individual documents out of each of those elements. That's not necessarily bad (and we'll see later where having the ability to do that as needed is a good thing) but let's not pretend that you are addressing elements directly. You are not. A lot of the incorrect behavior of these systems (such as synthesizing invalid documents on export) comes from not realizing or admitting that their objects are documents and not elements in some element tree reflecting a single document (which is what they usually claim or the appearance they expose through UIs and APIs). Just saying.

OK, let's look at what we've done and what we've got so far:

- We started with a very simple link, an XInclude from inside one document to an element in another document. Our intent was to relate the xi:include element to a single warning element and we did that by pointing to the document entity that contained the warning element and for which the warning element was the root element.

In terms of our storage-management framework, this created a system of two documents with a dependency between the first document and the warning document of type "component of".

- We decided to put all our warnings into one document (for example, because they all go through a single approval workflow and must all be approved by the same deadline or because they're created and managed by one author). This required us to create a new document, warnings.xml. Into this document we copied the original warning from dont_run_scissors.xml as well as other warnings. We committed this new document into our system.

- By some means as yet unrevealed, we, the authors of the original document, came to know that the authoritative version of our warning is now in warnings.xml and that we need to create a new version of our document that reflects this new location. So we checked out our doc (let's call it doc_01.xml), added the necessary xpointer= attribute to the xi:include element, and committed this new version into the repository.

There's some interesting stuff going on here that I need to point out:

- The original version of doc_01.xml continues to irrevocably point to the original warning in dont_run_scissors.xml. The creation of warnings.xml did not change anything about this. If you were to process version 1 of doc_01.xml right now you would get the same result you got before we created warnings.xml--that is, the warning we would use would be the one dont_run_scissors.xml, not the one in warnings.xml.

- There are two versions of the don't run with scissors warning that we, as humans doing this work, know are versions in time. However, the information we have seen so far does not explicitly relate the two versions in any way and only weakly implies it through the two versions of doc_01.xml, which differ only in the form of address used for the xi:include (but note that could be because we decided to use an entirely different warning--there's nothing about the link that says we were linking to a new version of the same warning resource (in SnapCM terms)). And not that making each element its own document wouldn't help us here because the whole point was we wanted all the warnings in one document. If we want that reasonabl level of storage organization flexibility then we have to step up to being able to both address elements that are not document elements and provide some way of tracking the version history of elements regardless of their storage locations. Fortunately it's not too hard to do.

- The change to the warning, in this case a change to its physical location required us to react by creating a new version of our document doc_01.xml even though the content of the warning itself did not change and therefore we had no other reason to need to change doc_01.xml. This is very important. This is the essential problem in the management of versioned hyperdocuments. Think about the implications here for a large body of documents all of which use this standard warning.

From this simple use case, which is pretty much the simplest use case, you should start to see a few things with some clarity:

- Moving from addressing elements indirectly via reference to the XML documents of which they are the root to addressing elements anywhere inside their containing documents complicates things a good bit (mostly for address creation, which really means for authoring user interfaces).

- There is a need to track the version history of elements, not just storage objects. It would really be nice to know where our warning, as a unit of managed information in a non-trivial workflow, has been over its lifetime.

I picked warnings on purpose because they are the most obvious example of information for which there could be severe legal and safety implications and for which you therefore need to know what you said when and where you said it and what documents used which version and what time in the past. That is, when ScissorCo gets sued you need to be able to prove that your authors used the right warning in the right documents and therefore the plaintif should have known not to run with them. I also chose warnings because they are an obvious target of re-use and they tend to go through an authoring and revision workflow separate from any documents that use them. Keep that in mind as we go forward. Warnings are just an obvious instance of a more common general case in use-by-reference, which is using information among publications or data sets with different workflows that have no necessary or natural synchronizations. For example, where core content is developed on a per-engine basis but is used in publications whose workflow schedule is driven by specific product development and release cycles.

- During authoring (that is, during the revision life cycle of the information) there is a strong requirement for various forms of indirect addressing in order to avoid the very problem we ran into here: change to a link target requires changing the link source even though the semantics of the link were otherwise not affected.

The SnapCM model provides one form of indirect addressing, the dependency link, but that alone is not sufficient if we want to enable direct addressing of elements regardless of how they are stored (because SnapCM dependencies are only between storage objects). If your requirements can be met by only doing linking and addressing of document root elements then it is sufficient (although the implication is sometimes that you end up with a lot of very small documents). But it's not that hard to step up to doing indirect addressing of elements anywhere.

Finally, I'll leave you with one question: what W3C or OASIS or IETF standard provides a mechanism for doing indirect addressing of XML elements that are not document elements? [I left out ISO because we already know the answer: HyTime (ISO/IEC 10744:1996).]

Next time: Why indirection is so important for authoring

Labels: XCMTDMW "xml content management" indirection xinclude xlink linking hytime snapcm xpointer

Sunday, July 30, 2006

XIRUSS-T Update

I have updated the XIRUSS-T source code so that the latest code in the Subversion repository on SourceForge works correctly (the URL processing had been broken). The code that's there includes the beginnings of support for HTTP PUT and POST methods by which you can modify and add to the repository remotely. Using the code that's there, you can, for example, use an interactive Jython session and the XirussHttpClientHelper class to add things to the repository. Not that that's any sort of real client user interface but it does demonstrate that XIRUSS-T is moving from being write-once read-many to fully read/write.

I've also started implementing a simple REST API that will make it easier to implement clients. I've got the code framework in place (I've implemented returning a list of all the branches in the repository) and implementing the rest of the operations shouldn't take too long--it's mostly typing and working out what the best URL syntax should be.

I also plan to implement some sort of minimal graphic client UI that will let you import things, set version properties, create dependencies, commit snapshots, create branches, and so on. Unfortunately, I'm not much of a UI programmer so I don't know how much I'll really be able to do quickly.

Labels: xiruss xiruss-t

Friday, July 28, 2006

XCMTDMW: Import is Everything, Part 4

OK, back to our import use cases.

In Part 2 we left off after having imported the source XML for a publication (doc_01.xml) and its schema (book.xsd) and then having imported a second version of doc_01.xml without importing an unnecessary second version of the schema (because we were able to tell, through the intelligence about XSD schemas in our importer, that we already had the right schema instance in the repository).

We saw that the dependency relationships let us dynamically control which version or versions a link will resolve to at a particular point in time by changing the "resolution policy" of the dependency. This allowed us to import a new version of the schema without automatically making old versions invalid. It also gave us the choice of making some versions invalid or not as best reflected our local business policies.

Note that the storage object repository (Layer 1) doesn't care whether or not the XML documents are valid--it's just storing a sequence of bytes. It's the business processes and processors doing useful work that cares or doesn't. This is why we can put the repository into a state where we know some of the documents in it are not schema valid.

Also, it should be clear that whether you allow import of invalid (or even non-well-formed) documents is entirely a matter of policy enforced by the importer. For example, you could say "if it's not schema valid it's not getting in" or you could simply capture the validity state in metadata, as we've done here. You could have a process that will import everything, no matter what, but if, for example, an imported XML document is not well-formed, it will import it as a simple text file with a MIME type of "text" not "application/xml". It's up to you. If your CMS doesn't give you this choice out of the box you've given up a lot of your right to choose your policies.

In our repository we have the "is schema valid" property which will either be "true" or "false" (it could also be "don't know", for example, if you imported a document that referenced a schema you don't have or is in a namespace for which you have no registered schema).

Now imagine that we've built a Layer 3 Rendition Server that manages the general task of rendering publication source documents into some output, such as PDF or HTML. It's pretty likely that there's no point in rendering documents that are known to not be schema valid. With our "is valid" property the rendition server can quickly look to see if all the components of a publication are valid before it does any processing, which would be a big time saver.

Likewise, we can easily implement a Layer 3 management support feature that notifies authors or managers of invalid documents so they know they need to modify them to make them valid again. This is especially important if, as in this case, we might unilaterally cause documents to become invalid through no fault on the part of authors.

Anyway, back to the use cases.

I've stipulated that doc_01.xml represents a publication, that is, a unit of publication or delivery. This notion of "publication" is my private jargon but the need for it should be clear in the context of technical documentation and publishing. Most publishing business processes are driven by the creation of publications or "titles" or "doc numbers".

But there's nothing particular about the XML data that identifies it, generically, as the root of a publication and in fact there's no requirement that a publication be represented by a single root document (although that's a simple and obvious thing to do).

So we probably need some business-process-specific metadata to distinguish publication roots from non-publication roots. So let's define the metadata property "is publication root" with values "true" or "false". For doc_01.xml we set the value to "true" since it is a publication root.

Now we can do some interesting stuff--we can query the repository to find only those documents that are publication roots, which would be pretty useful. For example, it would allow us to narrow a full-text search to just a specific publication or just produce a list of all the publications. If we also have process-specific metadata for publications, such as its stage in the overall publication development and delivery workflow, we can see where different publications are. If we capture the date it was last published we can know if its up to date relative to some other related publications. You get the idea.

So import a new document already!

Fine. Let's import doc_02.xml, which is the root of another publication. It conforms to the latest version of book.xsd. We have authored this document with knowledge of doc_01.xml and have created a navigation link from doc_02.xml to doc_01.xml, like so:


<p>See
<bibcite href="http://mycms/repository/resources/RES0001">Document One</bibcite>
for more information</p>

It also has a link to a Web resource:


See
<bibcite href="http://www.docbook.org"/>docbook.org</bibcite>
for more information</p>

Note that the link to doc_01.xml uses a URL that points into our repository. But this is an absolute URL, which it needs to be as long as the document is outside the repository. It is a pointer to the resource for doc_01.xml which will be resolved by the repository into the latest version of that resource, which at the moment will be version VER0003, the second version of doc_01.xml.

This is the easiest case for import because it's unambiguous what the link points to in terms of data that is already in the repository. The address will still have to be rewritten on import but there's no question what it needs to be rewritten to and no question that the target is a version that is already in the repository.

By contrast, if we had instead authored the link as a reference to a local copy of doc_01.xml, the link might look something like this:


<p>See
<bibcite href="../doc_01/doc_01.xml">Document One</bibcite>
for more information</p>

In that case, the importer has to figure out, by whatever method, whether the file at "../doc_01/doc_01.xml" is in fact a version of a resource already in the repository and whether or not this local version of doc_01.xml is itself a new version that needs to be imported. Again, figuring this out cannot be automatic in the general case but depends on implementation-specific mechanisms.

This raises the question of how the link was authored in the first place. It's unlikely the author looked inside the repository to figure out that doc_01.xml is really RES0001 and then typed the correct URL. So the authoring tool must be integrated with the repository such that the author can request a list of potential reference targets, pick one, and have the most appropriate address put into the href= attribute value. So there must be some sort of integration API that the repository exposes that the authoring tool can use.

Note too that the details of the link authoring are completely schema-specific as are the policy rules for what can be linked to. In this case, let us assume that the "bibcite" (bibliographic citation) element can only link to entire publications. That's the easiest case because it only requires us to point to entire storage objects, which we can do with just our storage object repository and which is the easiest UI to create (a flat list of publications). Note also that by adding the "is publication root" metadata property we've enabled the creation of just this selection aid since now we can query the repository to get a list of publication root documents. For a complete implementation we'd probably want to capture things like document title and document number (if it's known) as storage object metadata just for convenience (we could always look inside the documents at the time we build the UI but that would be time consuming, easier to capture it on import or get it from somewhere else once).

If rather than pointing into the repository we pointed to a local working copy, then the UI could be as simple as just putting up a file chooser and letting the author figure out which file is a publication root (which they would probably know if you've imposed a consistent file organization scheme, such as putting each publication in its own directory under a common root directory). Or you could export sufficient metadata to enable the same UI as before or you could ask the repository, as above, but then, because the repository remembered where stuff was checked out, it knows what local URL to use.

Which approach is best depends a lot on how the data will be used or handled outside the repository. If your authors are always connected to the repository, no reason not to point to it directly. If your authors need to able to work offline then you'll have to go the local working copy route. And of course you can support both modes of operation from a single repository because it's entirely a function of the importer and exporter logic.

In any case, on import the resulting URL for pointing to doc_01.xml is "/repository/resources/RES0001", that is, a relative URL (because everything's in the same storage space at this point). We also determine that we don't need to create a new version of doc_01.xml so we don't.

For the second bibcite, the one to the DocBook site, what happens? The importer could blindly look to see if it has anything for the DocBook site, see that it doesn't, and start importing all the HTML from the site as new resources and versions (XIRUSS-T includes a default HTML importer that will do this). But that's probably not what you want to do.

So the importer has to have some rules about what things are and are not ever going to be in the repository and external Web sites are probably never going to be in the repository. So on import the importer does not rewrite the href= to the DocBook site.

However, it could do something very interesting. It could create a new resource and version in the repository that acts as a proxy for the DocBook Web site. This would be useful because then we can create a dependency relationship between version doc_02.xml and the DocBook Web site without having to literally copy the Web site into our repository. This lets us manage knowledge of an important dependency using the same facilities we use for all our dependencies and gives us a way to capture and track important properties of the Web site using our local metadata facilities. The version is a version that is just a collection of properties with no data (although we could capture data if we wanted to, for example, to capture a cache of the state of the target Web site at the time we did the import).

So how has the repository changed following the import of doc_02.xml?

- Created new resource RES0005 and new version VER0005 for doc_02.xml.

- Set the "is publication root" and "is schema valid" properties to "true" for VER00005.

- Created new resource RES0006 and new version VER0006 for www.docbook.org.

- Created three new dependencies from VER0005 to the following resources:

- "governed by" dependency to RES0002 (the book.xsd schema)

- "document citation" to RES0001, reflecting the first bibcite link

- "document citation" to RES0006, reflecting the second bibcite link

Now lets do something useful with this data we've worked so hard to create: print it.

Let's say we have an XSLT script that converts book.xsd documents into XSL-FO for rendering into PDF. To apply this script to a publication we need only point our XSLT engine at the publication root version and style sheet and away it goes, e.g.:

c:> transform http://mycms/repository/versions/VER0005 book-to-fo.xsl > temp.fo

Because our repository acts as an HTTP server we don't have to do an export first.

But there's an important question yet to answer: what does our XSLT script do with the various links?

In doc_02.xml we have to links to separate publications and those links need to be published in a form that will be useful in the published result. What does that mean?

In this case, we're publishing to PDF so we can presume that we want the links to be navigable links in the resulting PDF. Easy enough. But what will the links be to?

Hmmm.

In the case of the link to the DocBook Web site that's pretty easy: just copy the URL out as it was originally authored (or as constructed through the use of our proxy version which will have had to remember the original URL or the normalized URL for the target Web site). No problem. Unless we use the proxy object, in which case either the XSLT has to know how to translate a reference to the proxy into a working URL, e.g., get the proxy object, get the appropriate metadata values, and go from there, or the repository has to provide a "getUrlForWebSite()" method that takes a Web site proxy object as input and returns the best URL to use for getting to the Web site itself. This type of function could be characterized as "top of Layer 1" or "bottom of Layer 2" bit of functionality, in that it's generic but it's reflecting our locally-specialized version types. But in this case it's generic enough that it should probably be built into Layer 1. But since it deals with issues of link resolution and data processing it's arguably a Layer 2 functionality.

In any case, the Web site link is relatively easy.

But the link to publication doc_01.xml is a bit trickier: we almost certainly don't want the PDF to link to the original source XML, either as it resides in the repository or in some checked-out location. We want it to link to doc_01.xml as published. But what is that?

This is the tricky bit: if we haven't already published doc_01.xml then we either have to first publish it and then point to that result or we have to be able to predict in advance where it will be when published or we have to be prepared to post-process the published result (the PDF in this case) to rewrite the pointer to doc_01.xml as published at such time as we know where it is. And even then, if we move the PDFs around we may still need to rewrite the pointers.

This suggests that we need to be able to do pointer rewriting. For anything. But we already have a generic facility for that in our import/export framework. Happy day! All we have to do is implement code that knows how to do it for PDF and Bob's your uncle. [Have I had too much coffee this morning?]

This also suggests that the best place to publish to is the repository itself, because we can both easily serve the results from there and we can easily export them as needed, doing any pointer rewriting that might be necessary. We can also establish dependency relationships between the published results and their source data and capture any other useful metadata about the published artifacts. Because the core repository is generic there's no problem using it to store PDFs or anything else.

So we're starting to build up a set of components and repository features that together form a "Rendition Manager" that handles the generic aspects of publishing. This rendition manager needs to do the following:

- Get the input parameters for applying a given rendition process to a given version or set of versions.

- Provide the appropriate utility functions to rendition processors needed to get access to object metadata, resolve pointers, and so forth.

- Manage the import of newly-created rendition results back into the repository reflecting its knowledge of the inputs to the process. That is, while we can certainly have a generic PDF importer, we need a PDF importer that also knows that PDF doc_01.pdf was generated from version VER0003 of doc_01.xml and sets a dependency relationship reflecting that.

Some of this rendition manager can be built into the Layer 1 code, as discussed above (i.e., the API or protocol functions needed) but the management of the specific processors will be be a Layer 3 component. That is, conceptually, the Rendition Manager is a client of Layers 1 and 2, in just the way an integrated authoring tool would be.

But you must have some form of Rendition Manager in order to do manageable publishing from the repository unless you do everything via bulk export and ad-hoc processes.

This is an important question to ask of any full-featured CMS provider: do you provide features and components that either comprise a rendition manager or make creating one easy?

Ok, so we run our rendition process and create a new PDF, doc_02.pdf, and bring it into the repository. The link to doc_01.pdf uses the URL "/repository/resources/RES0007". The link to the DocBook Web site uses the url "/repository/resources/RES0008". In the repository we create the following new objects:

- Resource RES0006 and version VER0007 for doc_02.pdf. It's MIME-type property indicates that it is of type "application/pdf".

- Resource RES0007 (and no version) for doc_01.pdf. Surprised? This reflects the fact that we know that at some point in the future there will need to be a doc_01.pdf but we haven't created it yet. The resource object lets us link to it even though we haven't created any versions.

- Resource RES0008 and version VER0008 for the Web site www.docbook.org. The metadata would include the absolute URL of the Web site and anything else we can usefully glean from it.

- A dependency of type "rendered from" from VER0007 to resource RES0005 with policy "Version VER0006" indicating the exact version the PDF was created from.

- A dependency of type "navigates to" from VER0007 to resource RES0008, indicating the link to the docbook.org Web site

- A dependency of type "navigates to" from VER0007 to resource RES0001, indicating the link to doc_01.xml.

Why did we create the dependencies from the PDF document? We'll need these should we ever need to export a set of inter-linked PDFs to some delivery location, i.e., the external corporate Web site, an online review server, our local file system, whatever. We also need to know whether or not all the link dependencies are satisfied. We also may need to know if the workflow states of the source publications are those required in order to complete a publishing operation, which we can get by navigating from a given PDF to its publication source to see if it is, for example, in the "approved for publication" state, or if it exists at all.

For example, lets say that doc_01.xml version VER0003 is in fact in the "approved for publication" state, as is the latest version of doc_02.xml. If we try to do the "publish to corporate Web site" action (a Layer 3 process), we'll first chase down all the "navigates to" dependencies so we can get the PDFs of targets that are PDFs. We navigate to resource RES0007 and discover that it has no versions. With no versions we can't go on--we have no way of knowing, with the repository data we have, what publication might correspond to this PDF resource. Hmmm.

One way to address this would be to create a "rendition of" dependency from versions to the renditions generated from them. But those dependencies would be redundant with the equivalent links from the renditions to their source versions. In thinking about it it makes more sense to create a resource-to-resource "rendition of" relationship.

This can be done with metadata on the resource object where the value is just a list of resources that are renditions of this resource. There's no need for indirection because we don't need to select a version, we just need to know that resource RES0007 (doc_01.pdf) is a rendition of resource RES0001. We need to know this because when we finally get around to rendering doc_01.xml we need to know what resource the PDF we create is a version of. The PDF-to-source dependency links will establish the version-to-version relationships.

Ok, so we do that such that resource RES0007 has the metadata property "rendition of" with the value "RES0001" (doc_01.xml).

Now when we go to do our publication, we resolve the navigates-to dependency from doc_02.pdf to resource RES0008, the DocBook.org Web site. We discover that this is a resource that is really outside the repository (by looking at the resource or version metadata, which I haven't shown). We see that it's a link to a Web site so we try to resolve the URL to make sure the Web site is at least still there. We can't really know if the Web site is still relevant without putting a human in the loop but we can at least catch the case where the Web site or specific Web resource is completely gone or unreachable.

Next, we resolve the navigates-to dependency from doc-02.pdf (VER0007) to resource RES0007 and see that it is a rendition of resource RES0001. We get the latest version, VER0003 and check its "approved for publication" status. It's "true" so we can continue. If it had been "false" we'd have to stop right there and report back that not all the dependencies are ready to be published externally.

But we still have the problem that there's no PDF for doc_01.xml. What do we do?

We could halt the process and report that somebody needs to render doc_01.xml or we could just do the rendering job ourselves as we know that all the prerequisites have been met. Let's do that. This creates new PDF doc_01.pdf, which we import into the repository just like we did doc_02.pdf, with all the same dependencies and properties and whatnot.

Now our requirement that all the local dependencies are satisfied is met. Everything's in the correct workflow state, so we now export the PDFs out in a form that can be placed on the corporate Web site. To do this we have to rewrite the URLs of the navigation links from pointers to PDFs inside the repository to pointers to PDF in what their locations will be on the corporate Web site.

This means that the exporter component has to know what the business rules are for putting things on the corporate Web site, either directly because the rules are coded into the software, or indirectly because, for example, the PDF version objects have metadata values that say what the location should be or must be.

Let's keep it simple and say that the PDFs are located relative to each other and in the same directory. This means we can rewrite the within-repository URL from "/repository/versions/VER0003" to "./doc_01.pdf".

Whew.

We've finally produced some usable output from our system. Time to go home and celebrate a job well done.

Let's review what we've done and seen:

- We've taken a system of two inter-linked publications through a cycle of authoring and revision and publication.

- We've created document-to-document hyperlinks using services provided by the Layer 1 storage manager coupled with Layer 3 customizations integrated into our authoring tool (had I made it clear that authoring tools are Layer 3 components? That should be obvious by now as, except for simple text editors, they're all about the semantics of your documents.).

- We enabled sophisticated workflow management reflecting local business rules and processes just by adding a few more metadata properties to our version, resource, and dependency objects.

- We created a Rendition Manager that can manage the creation of renditions from our documents such that the rendered results are themselves managed in the repository, which is a requirement in order to support processes such as publication to a corporate Web site or any operation that requires address rewriting on export.

- We created a Layer 3 component that manages the "publish to corporate Web site" action by using storage object metadata and dependencies to establish that all the necessary prerequisites are in place (workflow state, existence of renditions) or, if necessary, use the Rendition Manager to produce needed components (doc_01.pdf).

- We introduced resource-to-resource links using simple metadata values on resources to establish relationships between resources to support the case where a resource may be created in advance of having any versions.

- We made it clear that our repository can not only manage any kind of storage object but that it's essential that it do so in many cases. Thus we put our PDF renditions back into the repository from which they can be accessed directly for viewing or exported for delivery from other places.

- We saw the utility in creating "proxy" versions for things we don't own or control so that we can manage our dependencies and metadata on those resources within the repository, keeping all our processing closed over resources, versions, and dependency objects. Very important. You can do all sorts of really useful and clever things with these proxies, including mirroring resources managed in other physical repositories as though they were in yours. [Pinky to corner of mouth, low evil chuckle. Mischievous faraway glint in eyes. Absently pat head of Mini Me.]

This is pretty sophisticated stuff and is more than a lot of commerical systems do today (while at the same time they do stuff you don't want or need). And we've done it all with relatively simple software components that are connected together in clever ways. Because all the Layer 3 stuff we've invented for this use case can be built in isolation both from the Layer 1 repository and from each other, they can be individually as simple or sophisticated as needed or as you can afford. For example, the Rendition Manager could really just be a bunch of XSLT scripts or it could be a deeply-engineered body of Java code served through a full-scale Web server and designed to handle thousands of rendition requests an hour. But the minimum functionality of each of these components is pretty modest and no single component represents an unreasonable implementation difficulty--it's all very workaday programming: get an object, get a property value, chase it down, check a rule, get the target object, check it's properties, apply a business rule, run a process, move some data, create a new resource, set some properties, blah blah blah is it lunch yet?

That is, the requirements might be broad but the implementation need not be deep and it certainly doesn't need to be monolithic or exclusive.

I know that at this point you've been given a lot to think about and if you've read this far in anything like one go your head is probably spinning. Mine is and I've written this stuff.

Hopefully I've succeeded in binding these general concepts and architectures to realistic use cases and processes that make it easier to see how they apply and where their power as enabling abstractions and implementation and design techniques really accrue.

Note too that we've only done document-to-document links--we haven't said anything about element-to-element links and what that might imply. That's actually because most of the inherent complexity is at the storage-object-to-storage-object level. Going to element-to-element linking really only complicates user interfaces and presents some potential scale and performance problems (because of the sheer potential volume of data to be captured and managed because there're typically orders of magnitude more elements than storage objects). But the fundamental issues of resolution and dependency tracking are the same so we'll see that we really don't have to do much more to our system to enable creation, use, and management of element-to-element links. We've done almost all the hard work already. And it wasn't really that hard.

[As an aside: I'm pretty happy with how this narrative is coming out even in this first draft. I fully intend to edit it together into a more accessible, coherent form as soon as I get it all out of my head, which shouldn't take too much longer. I hope.]

Next time: element-to-element linking (probably)

Labels: XCMTDMW "xml content management" import namespaces

XCMTDMW: Import is Everything, Part 3

I hope I'm starting to get the point across that import, the act of crossing the boundary between outside and inside the repository, is where everything really happens. Because if I'm not making that point something is really wrong.

Before we continue exploring the import and access use cases started in Part 2, let's talk about schema-specificity for a moment, because I want to be careful I'm not painting too rosy a picture with all my talk about generic XML processing.

One issue with managing XML documents is the sensitivity of the management system to the details of the schemas. In the worst case the low-level repository schema directly reflects the schema such that any change to the document schema requires a change to the repository schema which, in the worst case, requires an export and re-import of all the data in the repository, which is a dangerous and disruptive thing to have to do.

That's clearly crazy and any system that has that implication is so inappropriately overoptimized that it makes make crazy to even think about it.

We've also seen that a completely generic system for importing XML, while useful, isn't nearly complete enough to support the needs of local business proceses and business rules.

In yesterday's entry, Import is Everything, Part 2, we had just gotten to the point where we were creating, on import, some storage object metadata properties that were specific to our local policies, such as the "is schema valid" property, in the sense that we needed those properties in order to implement our policies and the business processes or user actions they implied. But those properties are still generic with respect to the document's schemas. An XML document is either schema valid or it isn't, regardless of the schema.

Because we were operating on just the XSD-defined links (schemaLocation=) some of our import processing was schema-specific but specific to a completely standard schema (XSD), not to our local schema.

But we're about to explore some use cases where we do need local schema awareness and we'll start to see where that awareness resides in the code. The short answer is, it resides in the import processing, top-of-Layer 2, and Layer 3 components. None of these should require a complete export and import of the documents involved should the schema change, although they might require reprocessing some or all of the documents in the repository (but directly from the respository).

It should be pretty clear by now that any extraction of metadata or recognition of dependency relationships that is schema-specific will of course happen in schema-specific import code. That's why the XIRUSS-T importer framework is designed the way it is, because you always have to write at least a little bit of code that is unique to your schemas and your business processes so why not make writing that code as easy as possible?

By "top-of-Layer 2" I mean code that does semantic processing of the elements inside the documents, such as link management, that sits on top of the generic facilities in Layer 2 but that may be schema specific, for whatever reason (usually optimization necessary to achieve appropriate scale or performance). For example, any full-text or element metadata index is a Layer 2 component. You can implement a completely generic, schema-independent indexing mechanism but for non-trivial document volumes and/or sophisticated schemas you will very likely want to tailor the index to both not index things you're unlikely to ever search for or to index things in a way that is more abstract than the raw XML syntax (I'll talk about these in more detail when I get around to full-text indexing of XML as a primary topic). To implement these specializations you'll need to tailor the indexer and possibly the UI for using the index in ways that are schema specific. No getting around it.

Likewise Layer 3 is where you implement functionality that reflects specific business processes and policies, which means processes that act on the XML in the repository as well as on the storage-object and element metadata in order to do useful stuff. Much of this functionality will be schema specific to one degree or another (but not all of it, of course).

So unless you can get by with a very generic system that only implements support for standards, you will always have to create and maintain system components that are schema specific. However, there are some important characteristics of a system architected as I've outlined here:

- The core storage object repository, Layer 1, is never schema specific. This means that the importers and Layers 2 and 3 can change without ever effecting the storage objects managed in Layer 1. In particular, you will never require an export and re-import if Layers 2 or 3 change.

- The code most sensitive to the schema details is closest to the edges of the repository and, in most cases, builds on more generic facilities. This has two advantages: the amount of code that is actually schema specific is minimized and the disruptive potential of changing that code is minimized.

- You get to choose, as a matter of policy and implementation, the degree of schema specificity is appropriate for a given feature. You can choose whether your full-text index is generic or tailored, the degree to which you reflect the semantics of your link types in the dependency objects created from them, and so on. So you can start small and work up as both your understanding of your business processes improves and as your schemas become more stable (assuming you're starting from scratch with brand-new schemas).

Regardless of how its architected or implemented, most of the ongoing maintenance and operating cost of an XML-aware CMS comes from reaction to changes in the schemas of the documents managed. The only question is: does the CMS design and implementation minimize that cost or does it maximize it?

Also, when you start planning for the creation and deployment of an XML-aware CMS you need to define your overall requirements such that you can clearly distinguish those requirements that are schema-specific or schema-sensitive from those that are not. For example, a requirement to impose a basic workflow onto documents is probably not schema specific but a requirement to manage a particular kind of link that is not defined in terms of any standard is schema specific.

By separating the requirements in this way you can both better estimate the immediate and long-term costs of supporting those requirements and help the implementors keep the code that is schema independent more clearly separated from the code that is schema sensitive. This will go a long way toward making your system much less expensive to maintain in the long run and much more flexible in the face of new requirements, whether they are new business processes or new schema features.

Next: More linking and stuff

Labels: XCMTDMW "xml content management" import

Thursday, July 27, 2006

XCMTDMW: Import is Everything, Part 2

At the end of part one we had successfully imported our system of two documents, our publication source document, doc_01.xml, and the XSD schema document that governs it, book.xsd. We created the dependency relationship between doc_01.xml version 1 and book.xsd and we captured as much object metadata as we could given what little we knew about the data at hand. This created a repository with the following state:

/repository/resources/RES0001  - name: "doc_01.xml"; initial version: VER0001
/repository/resources/RES0002  - name: "book.xsd"; initial version: VER0002
           /versions/VER0001   - name: "doc_01.xml"; Resource: RES0001
                                 dependency: DEP0001
                                 namespaces: http://www.example.com/namespaces/book
                                 root element type: "book"
                                 mime type: application/xml
                                 xml version: 1.1
                                 encoding: UTF-8
                    /VER0002   - name: "book.xsd"; Resource: RES0002
                                 root element type: "http://www.w3.org/2001/XMLSchema:schema"
                                 namespaces: http://www.w3.org/2001/XMLSchema
                                 target namespaces: 
http://www.example.com/namespaces/book
                                 mime type: application/xml
                                 xml version: 1.0
                                 encoding: UTF-16
           /dependencies/DEP0001 - Target: RES0002; policy: "latest"
                                   Dependency type: "governed by"

We saw that it was the importer that needed to have all the XML awareness.

Now we need to see what happens when we do something with our data. There are two interesting use cases at this point:

1. Creation of a new version of doc_01.xml

2. Creation of a new document governed by the same schema

For use case 1 lets say that by some mechanism, and it doesn't matter what, we end up with a new document outside the repository called doc_01.xml that is different in its data content from the doc_01.xml we imported as VER0001 into the repository. E.g., we checked VER0001 out of the repository, edited it, and now want to check it back in. Or we left the original doc_01.xml where it was on our file system, edited that copy, and now want to check it in. Or our editor accessed the bytes in VER0001 directly from the repository, let us edit them, and now wants to create a new version in the repository. It doesn't matter how we come to have the changed version of doc_01.xml, the import implications are more or less the same.

The first steps of the import process are the same:

1. Process the XML document semantically in order to discover any relationships it expresses via links in order to determine the members of the bounded object set we need to import.

2. Process the compound document children of the root storage object, i.e., book.xsd. We determine that book.xsd has no import or include relationships to any other XSD documents

Assuming we haven't changed the schema reference or the "book" element's namespace, we get the same result BOS we did before: doc_01.xml and book.xsd.

3. For each member of the BOS, determine whether or not the repository already has a resource for which the BOS member should be a new version.

Now it gets interesting. First, we have to determine if our new doc_01.xml is really a version of resource RES0001. Remember: there's no general solution to this problem--you have to do something to either remember this information outside the repository or provide some heuristic for figuring it out when you need to or simply ask the user.

When I said above that it didn't matter how we came to have a new doc_01.xml that wasn't quite true because the way that we came to have it will likely determine how we know what version and resource it relates to in the repository.

If you use a check-out operation then you can capture the information about what version and resource you checked out, either as separate local metadata or embedded in the XML document (for example, as a processing instruction or attribute value). Putting the metadata in the document itself is safer because then you can't (easily) lose it but it limits you to managing XML data only (and it's not really safe because you can't prevent an author from modifying it if they really wanted to). Putting the metadata outside the document is more general but then requires a bit more work, either on the part of authors (they have to know where things are or should be on the file system) or in terms of some local data management facility to maintain the information. But this is the approach that CVS and Subversion use. It's simple and it works fine as long as users know that the limits are on their ability to do things like move files around.

If you are accessing the bytes of a storage object directly via an editor then the editor can just remember where it got them from. This works as long as the editor doesn't crash or, if when it does crash, it's cached the metadata away somewhere.

But it can still happen that you just get a file from somewhere and whoever gave it to you tells you "this should be a new version of resource RES0001". For example, somebody might have made some changes offline and mailed you the file. In this case, you, the human, have to figure out what to do.

Note too that in the general case you can't depend on things like filenames. While we usually do as a matter of practice there's no magic to it. If you look at the repository listing above you'll notice that the resources and versions both have name properties. At least in the SnapCM model, these names are arbitrary and need to be unique in any scope beyond the object itself (and an object can have multiple names--they're just metadata values as far as SnapCM is concerned). The invariant, unique identifiers of the objects are the object IDs (RES0001, VER0002, etc.). For versions, the ultimate identifier is the resource they are a version of.

For example, say you like to reflect the version of a file in the filename itself, a common practice when people are not using an actual versioning system. You find you've got directories full of files like "presentation_v1.ppt" and "presentation_v2.ppt" and "presentation_final_wek.ppt". The filenames may only be coincidently similar but you happen to know that they are all in fact different versions of the same resource, the presentation you were asked to write. In a repository like ours here you could import all these different versions and create them as versions of the same resource and they could keep their original names as their Name metadata value.

This is all to make the point that two storage objects are different versions of the same resource because we say they are and the general nature of the SnapCM model lets us say it however we want for whatever reason--there's no dependency on any particular storage organization or naming conventions or anything else. This means that you're free to apply the model to any particular way of organizing and naming things you happen to prefer. It also means that you can take any system of versioned information and recreate it exactly (in terms of the version-to-version and version-to-resource relationships) in a SnapCM repository.

Ok, back to our task. In this case we know that our local doc_01.xml is in fact a new version of resource RES0001.

Now we come to the schema, book.xsd. If we never exported it, meaning that we accessed it directly from the repository, then we will see that the pointer to it points back into the repository, that is, doc_01.xml as intially exported looks like this:

<?xml version="1.1"?>
<book xmlns="http://www.example.com/namespaces/book"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
     xsi:schemaLocation="/repository/resource/RES0002"
>
  ...
</book>

The importer can therefore know with certainty that we never created a new version (because versions inside the repository are invariant and cannot be changed) and therefore excludes it from the BOS to be imported. It's part of the BOS rooted at doc_01.xml, but since it's already in the repository we don't need to import it.

But if we had exported both doc_01.xml and book.xsd, such that doc_01.xml as exported looked like this:

<?xml version="1.1"?>
<book xmlns="http://www.example.com/namespaces/book"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
     xsi:schemaLocation="../schemas/book/book.xsd"
>
  ...
</book>

Then we've got a potential issue because we may or may not have modified the schema (possibly inadvertently if we, for example, opened it in an editor to see what it's rules were and as a side effect saved it, changing even just some whitespace).

The importer must now determine if it really needs to import a new version of book.xsd or not and, if it does, should it create it as a new resource or as a new version of an existing resource. How can it make this determination?

First, it can look to see if there is already a schema in the repository that governs the namespace "http://www.example.com/namespaces/book". It can make this determination by doing a query like "find all latest versions with root element 'http://www.w3.org/2001/XMLSchema:schema' and with targetNamespace value 'http://www.example.com/namespaces/book'". If this returns any versions then you know that you have at least one resource related to the target namespace that is an XSD schema (and not, for example, a RelaxNG schema).

If you get back more than one resource then you have a problem: either something screwed up on a prior import and created two resources for what should have been one resource or you have two truly different XSD documents that both target the same namespace. Now you have to figure out which one is the correct one to use before you can even decide whether or not to create a new version of it. How do you decide?

I find this to be a tough question. The challenge here is partly a function of the details of XSD in that you can choose to organize an XSD schema into multiple XML documents, all of which may name the same target namespace. But only one of them is the real root of the compound document, that is, only one of them can actually be used as the starting point for validating documents.

You might also have different variants of the same base schema for different purposes. For example, I have this case where I have one variant for publishing that defines global key constraints and another variant for authoring that does not, because for authoring the documents will be organized into many separate XML documents and XSD provides no way to constraint or validate cross-document references.

One way to handle this would be to use version metadata to indicate explicitly which of your XSD documents are schema roots and which are not. Another way would be to put that inside the schema as an attribute on the schema element or a subelement in your own namespace or whatever. And of course you could do both, with your XSD importer using the embedded metadata to set the storage object metadata.

But you should start to see that this is the first place where we are forced to integrate the repository with our local and non-standard business rules and that the knowledge and implementation of those business rules is in...wait for it...the importer.

It should also be clear at this point that no out-of-the-box XML-aware importer is going to do the thing you need except by accident or if you modified your policies in order to fit what the tool does. If the tool you choose happens to match what you already do or what you're happy to do, then great, you chose well, buy the engineers who built it a beer and go on your way. But if it doesn't....

Another approach would be to limit yourself to having exactly one XSD document per governed namespace. This is the easiest solution and a lot of times you can do it but it's not realistic as a general practice for the reasons given above.

OK, so schemas (and not just XSD schemas, any form of schema) complicate things.

So where were we? Oh yeah, is our book.xsd a new resource, a new version of an existing resource, or already in the repository?

In our current example there is only one version that governs the namespace so we only need to determine if we need to import our local copy. Here we have to look to see if it's been modified locally. If the local copy has not been modified, which we can know if we captured the time it was checked out at (this is what CVS does) and compare that to the last-modified time stamp on the file, then we know we don't need to import it. If it has been modified, or at least the timestamp has changed or if we didn't capture that (somebody just sent us a bunch of files and said load these up), then our only choice is to do some form of diff against the version in the repository.

We could just do a simple byte compare, which is easy to implement but for XML we might want to be more sophisticated and use an XML-aware diffing engine so we don't commit new versions that differ only in things like whitespace within markup. Again, this is a function of the importer and you, the importer implementor, get to choose how sophisticated you make it. For something simple like XIRUSS-T you can expect at most a simple byte-level diff. For a commercial system that claims XML awareness you should expect some sort of XML differencing that you can configure. Or you might just have to figure it out yourself by asking somebody or looking at the files or guessing.

OK, in our case we do a simple byte compare and determine that the file we have locally and the one in the repository are identical, so no need to create a new version.

3.1 In temporary storage (or in the process of streaming the input bytes into the newly-created version objects) rewrite all pointers to reflect the locations of the target resources or versions as they will be within the repository.

This is just like the last time.

3.2 For each BOS member, identify the relevant metadata items and create each one as a metadata item on the appropriate newly-created repository object.

Ditto

4. Having constructed our empty storage-object-to-version map, we execute the import process. In this case, we will construct the following new objects in the repository:

- Version object VER0003, the next version of VER0001 (and by implication, a version of resource RES0001)

- Dependency object DEP0002 from version VER0003 to resource RES0002, reflecting the governed-by relationship between doc_01.xml and book.xsd.

The new state of the repository is:

/repository/resources/RES0001  - name: "doc_01.xml"; initial version: VER0001
/repository/resources/RES0002  - name: "book.xsd"; initial version: VER0002
           /versions/VER0001   - name: "doc_01.xml"; Resource: RES0001
                                 prev_versions: {none}
                                 next_versions: VER0003
                                 dependency: DEP0001
                                 namespaces: http://www.example.com/namespaces/book
                                 root element type: "book"
                                 mime type: application/xml
                                 xml version: 1.1
                                 encoding: UTF-8
                    /VER0002   - name: "book.xsd"; Resource: RES0002
                                 prev_versions: {none}
                                 next_versions: {none}
                                 root element type:                                                 
"http://www.w3.org/2001/XMLSchema:schema"
                                 namespaces: http://www.w3.org/2001/XMLSchema
                                 target namespace: 
http://www.example.com/namespaces/book
                                 mime type: application/xml
                                 xml version: 1.0
                                 encoding: UTF-16
                     VER0003   - name: "doc_01.xml"; Resource: RES0001
                                 prev_versions: VER0001
                                 next_versions: {none}
                                 dependency: DEP0002
                                 namespaces: http://www.example.com/namespaces/book
                                 root element type: "book"
                                 mime type: application/xml
                                 xml version: 1.1
                                 encoding: UTF-8
           /dependencies/DEP0001 - Target: RES0002; policy: "latest"
                                   Dependency type: "governed by"
           /dependencies/DEP0002 - Target: RES0002; policy: "latest"
                                   Dependency type: "governed by"

Notice a few new things in this listing:

- I've added the prev/next version pointers to the versions. In SnapCM, each version can have more than one previous or next version where different versions are organized into different "branches", which I haven't talked about yet (our current repository is a repository with exactly one branch, if you want to be precise about it).

- There are two dependency objects which appear to be identical by the metadata shown. However, each dependency is owned by the version that uses it (it's really an exclusive property of the version) and its metadata is not invariant. In particular, you are likely to want to change the resolution policy for a given version as the state of the repository changes, as we'll see in a moment. Of course, a real implementation could transparently normalize the dependency objects so it only maintained instances that actually varied in their properties, creating new instances as necessary. But that's optimization we don't need to worry about here. [You may be starting to see the method in my madness: if I can think of a way it could be optimized I don't worry about reflecting that optimization in the abstract model, because I'm confident that if that optimization is needed it can be added to the implementation.]

- Except for maybe doing a diff on import, we've said nothing about the data content of the versions. That's because, for most purposes the data content is really secondary and arbitrary. There's nothing about the functioning of the repository itself (as opposed to the importer, which is all about the data) that has any direct knowledge of or dependency on the data inside the storage objects. You can think of the repository as a Swiss bank: it doesn't know and it doesn't want to know. Knowing is somebody else's job. By the same token, there are lots of types of versions that are only collections of simple metadata values and are not storage objects at all.

OK, so now we've successfully committed a new version of doc_01.xml into the repository, we correctly did not create an unnecessary new version of the schema. We did a good day's work, let's go home.

OK, not so fast.

We discovered that our schema is not complete with respect to our requirements and we have to add a couple of new element types or some attributes or whatever. The point is we have to modify it. We also discover that one of our existing content models is wrong wrong wrong and that we have to change it in a way that will make existing documents invalid. Doh!

So we check out version VER0002 to create a local copy of book.xsd. We edit it to change the content model, and go to commit it back to the repository.

But wait--if we do that, what will happen?

By default, all the dependency links from documents to their governing schemas use the "latest" policy. If we commit a new version we will effectively break those documents even though they are, today, valid against the current latest version of book.xsd in the repository. What do we do?

This is a matter of policy. You could choose to invalidate all the documents and require that they all be edited to make them valid. Sometimes that's the right thing to do based on whatever your local requirements are.

Or you could do this:

1. Find all the dependencies that point to schema book.xsd: "find all dependency objects of type 'governed by' that point to resource RES0002"

2. For each dependency, change its resolution policy from "latest" to "Version VER0002".

This changes the dependencies from being dynamic, resolution-time pointers to hardened version-specific pointers. Notice to that we didn't do anything to the versions involved.

Now, lets refine this operation a little bit by saying that, as a matter of our policy, we want to harden the links to schemas for all versions that are not the latest version of their resource. That is, we don't want to break any old versions but we do want to break the latest so that we know we have to fix it.

That means that for dependency DEP0001 we will change the policy to "Version VER0002" but for DEP0002 we will not. In addition, we will add a metadata value to the latest versions to indicate that we know they are not (or probably not) valid against their schema [I know I said that version metadata is invariant but actually some is and some isn't depending on the semantics of the metadata {or you can imagine that we created a new version to reflect the new metadata, updated the repository to reflect it and went on--since I have to type the repository state by hand, let's just say we can change version metadata.].

The new state of the repository is:

/repository/resources/RES0001  - name: "doc_01.xml"; initial version: VER0001
/repository/resources/RES0002  - name: "book.xsd"; initial version: VER0002
           /versions/VER0001   - name: "doc_01.xml"; Resource: RES0001
                                 prev_versions: {none}
                                 next_versions: VER0003
                                 dependency: DEP0001
                                 namespaces: http://www.example.com/namespaces/book
                                 root element type: "book"
                                 mime type: application/xml
                                 xml version: 1.1
                                 encoding: UTF-8
                                 is schema valid: true
                    /VER0002   - name: "book.xsd"; Resource: RES0002
                                 prev_versions: {none}
                                 next_versions: {none}
                                 root element type:                                                 
"http://www.w3.org/2001/XMLSchema:schema"
                                 namespaces: http://www.w3.org/2001/XMLSchema
                                 target namespace: 
http://www.example.com/namespaces/book
                                 mime type: application/xml
                                 xml version: 1.0
                                 encoding: UTF-16
                     VER0003   - name: "doc_01.xml"; Resource: RES0001
                                 prev_versions: VER0003
                                 next_versions: {none}
                                 dependency: DEP0002
                                 namespaces: http://www.example.com/namespaces/book
                                 root element type: "book"
                                 mime type: application/xml
                                 xml version: 1.1
                                 encoding: UTF-8
                                 is schema valid: false
           /dependencies/DEP0001 - Target: RES0002; policy: "Version VER0002"
                                   Dependency type: "governed by"
           /dependencies/DEP0002 - Target: RES0002; policy: "latest"
                                   Dependency type: "governed by"

Let's think about what we've done:

- We've used the indirection of the dependency links to change or preserve the processing result of the XML documents even though we didn't change the documents themselves. For the old version of doc_01.xml we preserved our ability to process it as a valid document by explicitly binding it to the latest version of book.xsd against which it was validated. For the new version of doc_01.xml we made the conscious choice to allow it to become invalid when we commit the new version of book.xsd.

- We added a new metadata value, "is schema valid" that allows us to capture information about the documents that reflects some aspect of their processing. In this case we're setting it because we know we're about to make it true, but you could imagine that we have a process that gets every latest XML document that is not a schema, validates it against its schema, and records the result in the "is schema valid" property. This could then drive a Layer 3 workflow application that every morning sends a report listing all the XML documents that are not valid. Or we could do a validation on import and indicate the result there. Whatever. The point is we've added more metadata that is specific to our business processes and policies.

Now that we've made the repository safe for a new schema version, we import our updated book.xsd document using the same process as before. The new state of the repository is:

/repository/resources/RES0001  - name: "doc_01.xml"; initial version: VER0001
/repository/resources/RES0002  - name: "book.xsd"; initial version: VER0002
           /versions/VER0001   - name: "doc_01.xml"; Resource: RES0001
                                 prev_versions: {none}
                                 next_versions: VER0003
                                 dependency: DEP0001
                                 namespaces: http://www.example.com/namespaces/book
                                 root element type: "book"
                                 mime type: application/xml
                                 xml version: 1.1
                                 encoding: UTF-8
                                 is schema valid: true
                    /VER0002   - name: "book.xsd"; Resource: RES0002
                                 prev_versions: {none}
                                 next_versions: VER0004
                                 root element type:                                                 
"http://www.w3.org/2001/XMLSchema:schema"
                                 namespaces: http://www.w3.org/2001/XMLSchema
                                 target namespace: 
http://www.example.com/namespaces/book
                                 mime type: application/xml
                                 xml version: 1.0
                                 encoding: UTF-16
                     VER0003   - name: "doc_01.xml"; Resource: RES0001
                                 prev_versions: VER0003
                                 next_versions: {none}
                                 dependency: DEP0002
                                 namespaces: http://www.example.com/namespaces/book
                                 root element type: "book"
                                 mime type: application/xml
                                 xml version: 1.1
                                 encoding: UTF-8
                                 is schema valid: false
                    /VER0004   - name: "book.xsd"; Resource: RES0002
                                 prev_versions: VER0002
                                 next_versions: {none}
                                 root element type:                                                 
"http://www.w3.org/2001/XMLSchema:schema"
                                 namespaces: http://www.w3.org/2001/XMLSchema
                                 target namespace: 
http://www.example.com/namespaces/book
                                 mime type: application/xml
                                 xml version: 1.0
                                 encoding: UTF-16
           /dependencies/DEP0001 - Target: RES0002; policy: "Version VER0002"
                                   Dependency type: "governed by"
           /dependencies/DEP0002 - Target: RES0002; policy: "latest"
                                   Dependency type: "governed by"

Now we we're starting to get some interesting stuff in the repository.

We have cross-document links (the links from the doc_01.xml documents to their schemas), we have version-aware link resolution, via the dependencies, we have both generic and business-process-specific metadata, and we have some sequences of versions in time.

We can also see that the repository itself stays remarkably simple--what you see here is not that far from what a fully-populated set of properties and objects would look like (as you can see if you run the XIRUSS-T application). You can also see that the repository state could easily be represented using a direct XML representation for export, archiving, or interchange (the storage object data streams could be held in the same XML or as separate storage objects on the file system).

But we've done some pretty sophisticated stuff what with intelligent handling of schema versions, managing our links using indirect, version-aware, policy-based pointers. How did we do it? We did it in the importer (and to a lesser degree, in the exporter), where all the complexity lies because that's where the specific knowledge of the data formats and their semantics and our local business objects, processes, and policies lie.

Let's talk about exporters for a minute.

I haven't said much about exporters because most of the complexity is in the importers because that's where you have to do all the initial syntactic and semantic processing to get the stuff into the repository. Getting it out is usually much easier.

In the best case there is no export at all: you access all storage objects directly from the repository without first copying them out to your local file system.

But in reality you will always need to do some exporting, if only for long-term, repository-independent archiving of your data (you do do that, right?).

For export, the main concern is rewriting of pointers on export so that the pointers point to the appropriate version of the correct resource in the correct location. As we saw above, this varies from doing nothing (if you are accessing the target object from the repository using the current resolution policy) to setting it to a relative URL that reflects where the target was copied to locally.

In addition, depending on how you manage the local file-to-version metadata on export, the exporter needs to set that metadata. Essentially, the exporter needs to have in its head a mapping from versions in the repository to their eventual locations, as exported, so it can then rewrite any pointers that need rewriting. This map is either explicit because the exporter creates it as it does the exporting or it's implicit in some file organization convention, the most obvious of which is that the export structure matches the directory (or folder or cabinet or whatever) structure in the repository.

Of course there's more that an exporter could do, such as creating Zip or tar packages of the exported files, loading the results into another repository, or whatever.

So exporters also have to be smart and they will also have some knowledge of the data formats to be exported (so they can, at a minimum, rewrite pointers) and local business rules and policies, but they are still much simpler than the corresponding importers and much of their work is probably already supported by facilities needed by the importer (such as XML attribute rewriting).

But we've now seen one complete cycle of the create-modify-create-new-version process, and once you can do one cycle you can do a million.

We still need to look a bit more closely at the implications of resolution of links via dependency objects. We also need to look at more linking cases, both for import and for processing. Finally, we need to look at the requirements and implications for rendition processing (that is, processing compound documents to produce a deliverable publication, such as PDF or HTML pages).

Next time: Linking and addressing with versioned hyperdocuments

Labels: XCMTDMW "xml content management" import namespaces

Wednesday, July 26, 2006

XCMTDMW: Has it Really Been 11 Years?

A slight aside: I was poking around on the internal Innodata Isogen sales and marketing support portal and stumbled on an archive of all the old papers that various ISOGENers have written over the years, including some of mine. One I stumbled on was one I wrote and presented in 1995 titled "SGML Document Management". You can find an HTML version of it here: http://www.oasis-open.org/cover/kimber-sgmldocm.html (thanks Robin).

Even though my understanding and ideas have refined and evolved over the years, it's remarkably consistent with what I've been saying in this thread. I would now replace the focus on entities with a focus on link-based re-use but the overall architecture is very much the same.

Another interesting historical footnote is the paper I wrote with Dr. Steve Newcomb and Peter Newcomb on "Referent Tracking Documents": http://www.coolheads.com/SRNPUBS/ref-track-docs-paper.pdf.

This paper describes a technique for using simple storage object versioning and straightforward link markup to represent links in a managable way. The key to this was that it provided, in the simplest possible way, a standards-based approach to capturing and managing complex element-to-element linking information. Of course we never expected that you would literally implement a system using huge collections of little documents but you could if you wanted to and it would work. It would just be really slow (or maybe not so slow--parsing XML is pretty fast and the files are small). At a minimum it provided a standards-based interchange representation for an arbitrarily complex link index. I've never actually tried to implement a system that used this approach literally, although the Bonnell and XIRUSS systems both reflect the ideas, just not expressed as literal XML structures (but they could be using exactly these techniques).

In any case, those ideas are now reflected more abstractly in the SnapCM model but the basic concept is the same, in particular the approach of using a reference to a resource (in the SnapCM sense) plus a resolution policy to address a specific version or versions. That paper was given in 1999.

Labels: XCMTDMW "xml content management" "referent tracking documents" snapcm

XCMTDMW: Import is Everything

We've talked a lot about what an XML-aware CMS should look like and what it needs to do. Now it's time to put something into it.

So first a little map of the area we're about to explore. Where we are is a border region, the boundary between where your XML documents are now, the "outside world", and where we want them to be, the "repository". Separating these two is a high ridge of mountains that can only be crossed with the aid of experienced guides and, depending on the cargo you're carrying, more or less sophisticated transport. [Or, on a bad day, some sort of demilitarized zone fraught with hidden dangers and mine fields on all sides.]

If you're just bringing in files containing simple or opaque data with little useful internal structure or references to other files, a simple mule train will do the job. But if you're bringing in interconnected systems of files containing sophisticated data structures you're going to need the full logistical muscle of a FedEx or UPS, who can offer a range of services as part of their larger transportation operation.

The point I guess I'm trying to make is that as soon as you go from files that are individual isolated islands of data to files that connect to each other in important ways, you're going from simple to dangerously complex.

Most, if not all, data formats used for technical documentation use or can use interconnected files to create sophisticated systems of files. The most obvious case is documents that use graphics by reference or point to style sheets or that have navigation links to another document. Even PDFs, which we tend to think of as atomic units of document delivery can have navigable links to other PDFs (or to anything else you can point a URI at).

So any repository import mechanism needs to be able to work with systems of files as systems of files, however those systems might be expressed in the data. Even if you aren't doing any semantic management but only storage object management, it is still useful, for example, to be able to import all of the files involved in a single publication as a single atomic action.

I want to stress here that while XML as a data format standardizes and enables a number of ways to create systems of files, it is not in any way unique in creating systems of files to represent publications.

This suggests that a generalized content management system must have generic features for both representing the connections between files and using and capturing those connections on import. We've already established that the storage management layer (Layer 1 in my three-layer model) should provide a generic storage-object-to-storage-object dependency facility. It follows that our import facilities should provide some sort of generic dependency handling facility.

At this point I want to define a few terms that I will use in the rest of this discussion:

- publication. A single unit of publishing, as distinct from the myriad data objects that make up the publication. This would normally translate to "doc number" or "title" but in any case it is the smallest unit of data that is published as an atomic unit for delivery and consumption. It is usually the largest or next to largest unit of management in a publication workflow in that you're normally managing the creation of publications for the purpose of publishing them atomically at specific times. That is while some information is published piece-meal as topics that are dynamically organized the typical case is you're publishing books in paper or as single PDFs. That book or that PDF is the "publication". Thus a "publication" is a business object that can be clearly distinguished from all other publications, i.e., by its ISBN or doc catalog part number or whatever. While it is not required, it is often the case that publications are represented physically by the top-level or "master" file of their source data (in DITA terms, by a map or bookmap).

- compound document. A system of storage objects with a single root storage object linked together using some form of semantic link (i.e., XIncludes, topicrefs, conrefs, or whatever) in order to establish the direct content of a publication or similar unit of information organization or delivery. What exactly constitutes the members of a compound document is a matter of specific policy and document type semantics. For example, if you have both XIncludes and navigation links among several XML documents you would normally consider only the XIncludes for the purpose of defining the members of the compound document.

- resource. The an object that represents and provides access to all the versions of a single logical version. For example, if you import a file for the first time, that creates both a resource and a version, which points to the resource. The resource represents the file as a set of versions. If you then import a second version of the same file, it would point to the first version from which you could then navigate to the resource. Resources are objects with unique identifiers within the repository. From a resource you can get to any of its versions. Therefore the resource acts as a representation of the file independent of any of its versions. Resources are vitally important because they are the targets of dependency relationships held in the storage management layer.

- version. An invariant collection of metadata and, optionally, data, related to exactly one resource and to zero or more previous or next versions of the same resource. When you import files into the repository you are creating new versions. Once created versions do not change (you could, for example, implement your repository using write-only storage). The only possible exception to their invariance is version destruction--there are some use cases where it is necessary to be able to physically and irrevocably destroy versions (for example, document destruction rules for nuclear power plans in the U.S. or removal of draft bills from a legislative bill drafting system).

- repository. A bounded system that manages a set of resources, their versions, and the dependencies between versions and resources.

- storage object. A version that contains data as a set of bytes. A storage object provides methods for accessing its bytes.

- dependency. A typed relationship between a specific version and a resource reflecting a dependency of some sort between the version and the resource. The pointer to the resource includes a "resolution policy" which defines how to choose a specific version or versions of the resource. The default policy is "latest". Therefore, by default, a version-to-resource dependency is a link between a version and the latest visible version of the target resource. Dependency policies can also specify specific versions or more complex rules, such as rules that examine metadata values, storage object content, the phases of the moon, the user's heart rate, or whatever.

All of these terms except "publication" are from the SnapCM model http://www.innodata-isogen.com/knowledge_center/white_papers/snap_cm.pdf.

- bounded object set (BOS). The set of storage objects that are mutually dependent on each other, directly or indirectly. A compound document reflecting only XInclude links would form one BOS. If you also reflected any cross-storage object navigation links you would get a different (larger) BOS. BOSes are useful for defining units of import and export as atomic actions. A BOS is "bounded" in that it is finite. When constructing a BOS that includes navigation links you may need to define rules that let you stop including things at some point, otherwise you might attempt to include the entire Web in your BOS with is, for all practical purposes, unbounded. It is a set in that a given storage object occurs exactly once in the BOS, regardless of how many times it might be linked from various BOS members. The creation of a BOS requires that you be able to determine the identity of storage objects in some way, distinct from the mechanism by which they were addressed. That is, given two different URIs that you can resolve, you need to be able to determine that the resulting resources (in HTTP terms) are in fact the same resource. All file systems should provide this ability but not all storage systems can do this.

This is almost all there is to the SnapCM model. There's a bit more that I'll introduce as we need it. I should also point out that you should be able to map the SnapCM abstract model more or less directly to any existing versioning system. For example, with Subversion, there is a very direct correspondence between the SnapCM version, resource, and repository objects and Subversion constructs.

Therefore SnapCM can be valuable simply as a way to think about the basic operations and characteristics of systems of versions separated from distracting details of implementation. That thinking can then be applied to specific implementation approaches or existing systems. For example, you might have some crufty old content management system built up over the years with lots of specialized features, no clear code organization or component boundaries, and so on. By mapping what that system does to the SnapCM model you might be able to get a clearer picture of what your system does in order, for example, to separate, if only in your mind, those features that are really core content management features and those features that are business-object and business-logic specific (import, export, metadata reporting, UIs, etc.).

For the rest of this discussion I will only talk about XML compound documents, because that's our primary focus and they are clear. But I want to stress that the basic challenges of import apply to any form of non-trivial documentation representation, proprietary or standard, and the basic solutions are the same. A system built to handle XML compound documents should be able to be quickly adapted to managing Framemaker documents just by adding a bit of Frame-specific import functionality. Note my stress on the quickly.

Let's start small, just a single XML document instance governed by an XSD schema. Let us call it "doc_01.xml". We want to import it into the repository. This is the simplest possible case for our purposes as we can assume that you will not be authoring documentation for which you do not have a governing schema. There are other XML use cases in which schemas are not needed or are not relevant. This is not one of them.

So right away we have a system of at least two documents: the XML document instance and the XSD document that governs it. To import this system of documents I have to do the following:

1. Process the XML document semantically in order to discover any relationships it expresses via links in order to determine the members of the bounded object set we need to import. We have to import at least the minimum required BOS so that the state of the repository after import, with respect to the semantics of the links involved in the imported data, is internally consistent. That is, if DocA has a dependency on DocB that if not resolved prevents correct processing of DocA, then if you only import DocA and not DocB, the internal state of the repository will be inconsistent. Therefore you must import DocA and DocB as an atomic action in order to ensure repository consistency.

In this case we discover that doc_01.xml uses an xi:schemaLocation= attribute to point to "book.xsd". This establishes a dependency from doc_01.xml to book.xsd of the type "governed by" (the inverse relationship, "governs", while interesting, is not a dependency because a schema is not dependent on the documents it governs).

We don't find any other relevant links in doc_01.xml.

At this point, we have established that doc_01.xml is the root storage object of our compound document and the first member of the BOS to be imported. We know that book.xsd is rooted (for this compound document) at doc_01.xml and will be the second member of our BOS.

2. Process the compound document children of the root storage object, i.e., book.xsd. We determine that book.xsd has no import or include relationships to any other XSD documents (if it did we would of course add them to our BOS).

At this point we have established a BOS of two members reflecting a compound document of two storage objects.

3. For each member of the BOS, determine whether or not the repository already has a resource for which the BOS member should be a new version.

Hold the phone! How can I possibly know, in the general case, whether a given file is already represented in the repository?

The answer is: you can't. There is no general way to get this knowledge. There are a thousand ways you could do it.

One approach would be to use a CVS- or Subversion- style convention of creating local metadata (the "working copy") that correlates files on the file system to resources and versions in the repository. This is a perfectly good approach.

Another approach would be to use some sort of data matching heuristic to see if there are any versions in the repository that are a close match to what you're trying to import. There are systems that do something like this (I know some element-decomposition systems will normalize out elements with identical attributes and PCDATA content).

You can use filenames and organization to assert or imply correspondence (if a file with name X is in directory Y on the file system and in the repository then they're probably versions of the same resource). Of course this presumes that the repository's organizational facilities include something like directories. Not all do.

Another approach is to require the user to figure it out and tell the importer.

This last approach is the only really generalizable solution but it's not automatic. In the XIRUSS-T system I've generalized this in the import framework through the generic "storage-object-to-version map", which defines an explicit mapping between storage objects to be imported and the versions of which they are to be the next version, if any. How this map gets created is still use-case-specific. It could be via an automatic process using CVS-like local metadata, it could heuristic, it could be via a user interface that the importing human has to fill out. But regardless you have to have some way to say explicitly at import time what existing versions the things you are importing are related to.

OK, for this first import scenario we establish that in fact the repository is empty so there's no question that we will be creating new resources and versions for both doc_01.xml and book.xsd.

4. Having constructed our empty storage-object-to-version map, we execute the import process, the result of which is that we create two new resources, one for doc_01.xml and one for book.xsd, and for each resource, the corresponding version, being storage objects holding the sequence of types from doc_01.xml and book.xsd respectively. We also create a dependency instance from the version doc_01.xml (let us call this doc_01.xml version 1) to the resource for book.xsd.

The creation of these objects in the repository is an atomic transaction such that, as far as the repository is concerned, the resources, versions, and dependencies all came into existence at the same moment in time. This is very important--if the import activity is not atomic then it cannot be easily rolled back and the repository will likely be in an incomplete, inconsistent state for some period of time. This is an important difference between CVS and Subversion, for example. CVS does not have any reliable form of atomic commit of multiple files while Subversion does. Any repository that cannot do atomic commits as a single transaction that can be rolled back is seriously limited and should be given a very close look. I don't know if it's been corrected in the meantime, but in 1999, when we were using Documentum to store documents for a bill drafting system, we discovered that Documentum could not do atomic commits as single transactions. This was very distressing to us.

Let's look at the data we have in our repository. For example, doc_01.xml might look like this:

<?xml version="1.1"?>
<book xmlns="http://www.example.com/namespaces/book"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
     xsi:schemaLocation="../dtds/book/book.xsd"
>
  ...
</book>

Anyone notice the problem?

The problem is the value of the xsi:schemaLocation= attribute: it's a relative URI that reflects the location of the schema on the filesystem from which the documents were imported. But we're not in that domain any more. We've crossed the pass through the mountains and we're into a different country with different language and customs. That URI may or may not be resolvable in terms of the location of the data within the repository.

If you're using a system like Subversion where the documents are never processed directly from the repository but are always exported first to create working copies and those working copies will reflect the original relative locations then that's OK, because the repository is really just a holding area.

But what you really want is the ability to process the documents in the repository directly from the repository (e.g., as though the repository were itself a file system of some sort). You want this because it's expensive and inefficient to have to do an export every time you want to process a document because, for most documents in the domain of technical documentation, there will be a lot of files involved, some of them potentially quite large (i.e., graphics). It would be much easier if you could just access the data directly, e.g., via an HTTP GET without the need to first make a copy of everything.

But in order to do that all the pointers have to be rewritten to reflect the new locations of everything in the repository as stored.

This is non-trivial but it's not that hard either. You just need to know what the repository-specific method of referring to objects within the repository is and what the mapping is from the objects as imported (that is, in their original locations) and the objects as stored. The exact forms of the repository-specific pointers could take many different forms: object IDs, HTTP URLs, repository-specific URIs, or whatever. In today's world it generally makes most sense for the repository to use URLs so that you can use standard and ubiquitous HTTP services to access your repository contents.

For example, the XIRUSS-T system defines a simple HTTP convention whereby you can refer to a version either by naming its resource by resource object ID and, optionally, naming a resolution policy (the default is "latest visible version") or by version object ID. The XIRUSS-T system also defines some basic organizational structures that can also be used to construct unambiguous and persistent URLs and you can define arbitrary organizational containers (analogous to directories) by which you can also address objects. So in XIRUSS there are two base addressing methods (resource ID + resolution policy and version ID) that will always work and can be constructed knowing only the resource ID or version ID and other "convenience" forms that will also work.

So for our example, let us assume that book.xsd results in resource object RES0002 and version object VER0002. We can rewrite the xsi:schemaLocation= value in doc_01.xml like so:

<?xml version="1.1"?>
<book xmlns="http://www.example.com/namespaces/book"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
     xsi:schemaLocation="/repository/resource/RES0002"
>
  ...
</book>

This is still a relative URL (relative to the server that holds the repository) but it is now addressing book.xsd as a resource/policy pair that can be reliably resolved to the appropriate version at any moment in time.

This need to rewrite pointers is universal if you want to able to process storage objects as stored and you don't want to limit yourself to the static and limiting organizational facilities of a typical file system (which you don't, trust me).

Therefore, you need an import framework or mechanism that can do two things:

- For a given storage object to be imported, determine what its address will be within the repository after import. This could either be by asking the repository (e.g., resource = repository.createResource(); resource.getId()) or by applying some established convention or using metadata within the data to be imported (for example, you might have already assigned globably-unique identifiers to your documents, captured as attributes on the root element, and you use those identifiers as your within-repository object IDs).

- For each storage object, whatever its format, rewrite the pointers to reflect the new locations. It should go without saying that this process shouldn't break anything else. However, this is sometimes easier said than done. For example, the built-in XIRSS-T XML importer imposes some limitations on what XML constructs it can and can't preserve during import, mostly for practical reasons.

This suggests that repositories should, as a matter of practice, provide some sort of import framework that makes it easy as it can be (which isn't always that easy) to implement these operations. Any repository that provides only built-in importers or that does not make creating new importers particularly easy should get a very close look because it's likely both that any built-in importer won't do exactly what you want done or everything you need done (even if what it does do it does just how you want). If, for example, the import API is poorly documented or incomplete or, for example, it doesn't provide any way to get, set, or predict a resource's ID in advance of committing it to the repository, you've got a problem.

This is an area that a lot of enterprises don't check when evaluating potential XML-aware content management systems but it is a crucial area to evaluate because it is where you will be investing most of your integration and customization effort. The last thing you want to have to do is call Innodata Isogen to help you figure out how to get your stuff into and out of the tool you've already bought. Not that we're not happy to help but we'd rather not see you be in that position at all. We'd rather you hired us to quickly implement the exact functionality you need, cleanly and efficiently, rather than bang our heads against some product that resists all our efforts to bend it to our will. We like to have fun in our jobs too.

So our initial import process wasn't quite complete. We need to insert step 3.1 to include the pointer rewrite:

3.1 In temporary storage (or in the process of streaming the input bytes into the newly-created version objects) rewrite all pointers to reflect the locations of the target resources or versions as they will be within the repository.

In XIRUSS-T's import framework I have generic XML handling code that supports this rewrite activity and essentially acts as a filter between the input (from the file system) and the output (the new version objects) to do the rewriting. This generic XML handling code can then be used by schema-specific code that understands specific linking conventions. For example, there is an XInclude importer that recognizes xi:include elements and knows that it is the href= attribute that holds the pointer to be rewritten, an XSD schema importer that knows about schemaLocation, import, and include, and an XSLT importer that understands XSLT's import and include elements. You get the idea.

Notice here the separation of concerns, separating the generic operation of essentially changing attribute values in XML documents from the concern of schema-specific semantics. It's just basic object-oriented layering and abstraction, but it's really important and done correctly makes building importers so much easier.

OK, so now our repository data is completely consistent. I can access doc_01.xml directly from the repository (e.g., using an HTTP GET request to access the byte stream stored by the storage object VER0001 [the first version of resource RES0001, which is the resource representing all the versions in time of doc_01.xml]).

The structure of our repository looks like this:


/repository/resources/RES0001  - "doc_01.xml"; initial version: VER0001
                     /RES0002  - "book.xsd"; initial version: VER0002
           /versions/VER0001   - "doc_01.xml"; dependency: DEP0001
                    /VER0002   - "book.xsd"
           /dependencies/DEP0001 - Target: RES0002; policy: "latest"

We can now interrogate the repository and figure some things out. We can ask for the latest version of resource RES0001 and we'll get back version VER0001. We can ask for the list of all dependencies from VER0001 and we get back a list of one dependency, DEP0001. DEP0001 points to resource RES0002 with policy "latest" which resolves, at this point in time, to version VER0002.

Assuming we have an HTTP server on the front of our repository that lets us access all these objects via HTTP URL, we can do something like this:


$/home/ekimber> validate http://repositoryhost/repository/versions/VER0001

The validation application, using normal HTTP processing, will open VER0001 as a stream (as it would for any HTTP resource), read its bytes, see the xsi:schemaLocation= value, resolve that URL normally, get those bytes, process them as an XSD schema (which they are) and validate the document. It's just that easy.

You can do this today with the XIRUSS-T system. For example, with XIRUSS you can have a document, it's governing schema, and an XSLT all in the repository and all accessed and used directly from the repository via normal HTTP processing using most XSLT engines without modification. It just works.

However, we're not quite done yet. While the repository holds all our data and is internally consistent we haven't captured any metadata other than the original filenames of the files as imported (which is sort of a given but not necessary or even always desired--I've done it here mostly to make the repository structure clear).

So we need to add step 3.2 to extract and set the appropriate metadata. For this example, that will include:

- For book.xsd, the namespace it governs

- For each XML document, its XML version, all of the namespaces that it uses and its root element type.

- For each text file (that is, a file whose MIME type is some flavor of "text"), the character encoding used. All XML documents are also text files.

- For all storage objects, their MIME type.

- For each dependency, the dependency type (e.g. "governed by").

3.2 For each BOS member, identify the relevant metadata items and create each one as a metadata item on the appropriate newly-created repository object.

The structure of our repository now looks like this:


/repository/resources/RES0001  - name: "doc_01.xml"; initial version: VER0001
/repository/resources/RES0002  - name: "book.xsd"; initial version: VER0002
           /versions/VER0001   - name: "doc_01.xml"; Resource: RES0001
                                 dependency: DEP0001
                                 namespaces: http://www.example.com/namespaces/book
                                 root element type: "book"
                                 mime type: application/xml
                                 xml version: 1.1
                                 encoding: UTF-8
                    /VER0002   - name: "book.xsd"; Resource: RES0002
                                 root element type: "schema"
                                 namespaces: http://www.w3.org/2001/XMLSchema
                                 target namespace: 
http://www.example.com/namespaces/book
                                 mime type: application/xml
                                 xml version: 1.0
                                 encoding: UTF-16
           /dependencies/DEP0001 - Target: RES0002; policy: "latest"
                                   Dependency type: "governed by"

Now we're getting somewhere. We can do a lot more interesting things with this information. For example, we can ask the question "what schema governs namespace 'http://www.example.com/namespaces/book'?". Or "what documents are governed by the schema for namespace 'http://www.example.com/namespaces/book'?" Or "what documents have the root element 'book'?" Or "give me all the XML documents". Or "give me all the XML documents that are not XSD schemas".

You get the idea.

Even if you have little implementation experience it should be fairly obvious that a basic implementation for these queries would be quite easy to implement--you just look at all the objects, examine their metadata values and match them to the query terms. Of course for anything real you'd probably use a proper database to index and optimize access to the metadata and there's no reason a normal SQL database wouldn't work perfectly well for that (even, perhaps, Gadfly). But the brute force solution is pretty simple yet yields amazing power.

The import process I've outlined here is pretty much the minimum you need to do for XML to get consistent and correct data and useful metadata. The XIRUSS-T system provides a sample implementation of exactly this process with support for a variety of standard XML applications (XSD, XSLT, DITA (albeit old IBM DITA), and XInclude).

If you are thinking about these operations in terms of your own XML, which if you're using XML for technical documentation today is probably pretty sophisticated, you are probably realizing that there's a lot more that you either need to do or could do in terms of capturing important dependencies and storage-object metadata.

Note too that we haven't said anything about Layer 2 metadata, that is, metadata that applies directly to elements. The closest we've come is to capture the root tag name of each XML document, which is just a reflection of the fact that it's a common query that's easy to support at this level so there's no reason not to capture it. [It helps particularly in supporting the common use case of doing document-level use-by-reference where you point to an entire document in order to use its root element by reference. In that case, if you have captured the root element type and governing schema you can implement reference constraints without having to look inside the document. That's a significant savings for a very common and reasonable re-use constraint {despite the fact that link-based use-by-reference enables using elements that are not document roots limiting yourself to document roots makes a lot of things simpler, especially in terms of author support user interfaces--it's much easier to present a list of files or documents then to present a list of elements drawn from who knows where. So if you can live with the constraint it's not a bad one to at least start with.}].

Because with the repository as shown we can both distinguish XML from non-XML storage objects and because we can resolve the XML-level storage-object-to-storage-object relationships using normal XML processing tools, we can start by implementing all our Layer 2 and Layer 3 functionality in a completely ad-hoc way using brute force or, for example, by maintaining the necessary optimization indexes and databases totally separate from the respository. With just what I've shown here it should be obvious that there's enough information and functionality to for example, write a simple process that gets each XML document and looks inside it in order to do full-text indexing or link indexing or whatever. This requires nothing more that the ability to send HTTP requests to the repository server (let us assume that you can use URLs to get specific metadata values or sets of metadata values).

From this it should be clear that any system that starts by indexing all the XML just as a cost of entry (meaning cost of initial implementation as well as cost of use) is so obviously doing premature optimization that it's not even funny.

To sum up:

- The boundary between the outside world and the repository is the Rubicon that you have to cross to get your data into the repository.

- For documentation formats, XML or otherwise, doing import requires the following basic operations:

1. Determining what the bounded set of storage objects is that need to be imported so that the result in the repository is complete and correct.

2. Rewriting any pointers in the imported data so that the imported result points to the targets as imported.

3. Extracting or gathering any storage object metadata and bind that metadata to the newly-created repository objects.

4. Instantiating the repository objects, including resources, versions, and dependencies, with their attendant metadata and, for storage objects, data content (as a sequence of bytes)

- For a given bounded object set to be imported, the import operation should be a single atomic transaction that can be rolled back (undone) as a single action. This ensures that the repository is always in a consistent state, even if the import processing fails midway.

- Some of the import processing can be generic (rewriting XML attribute values) but most of it will be schema specific (understanding how XSD schemas are related to other documents, understanding the linking syntax and semantics in your private document type). In a layered system you can build up from general to specific, taking advantage of relevant standards, to make creating schema-specific import processing easier to create and maintain.

- The existence of standards like XSD, XInclude and DITA makes it possible to build in quite a bit of very useful generic import functionality for those standards, which even if that's all you have, gives you a pretty good starting point.

- You can still get a lot of mileage out of just Layer 1 metadata, as demonstrated by the scenario walked through in this post. We haven't done anything to capture Layer 2 metadata yet we can already answer important questions about our documents as XML just through the simple metadata values we've captured.

- Note too that the repository itself, that is the Layer 1 structures we've seen so far, knows absolutely nothing about XML itself. The metadata mechanism is completely generic name/value pairs where you, the importer, specify both the name and the value. This is why something like Subversion is an excellent candidate to build your Layer 1 system on.

That is, all the XML awareness is in the importer and in the queries applied against the repository, not in the repository itself. That's one reason I chafe at the term "XML repository"--it strongly suggests over engineering and poor separation of concerns from the getgo.

- The correlation of files to be imported with existing resources and versions in the repository cannot be done automatically in the general case. You must define or provide one more ways to do it, either automatically using some convention (CVS) or interactively through human intervention. A completely generic repository should leave the choice up to the implementor and integrator. XIRUSS-T does this through its generic system-object-to-version map.

There's lots more to discuss before we've even covered the basics of XML import but that will have to wait until next time.

Next time: importing the next version of doc_01.xml: all heck breaks loose

Labels: XCMTDMW "xml content management" import snapcm

Dr. Macro's XML Rants