Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Friday, July 28, 2006

XCMTDMW: Import is Everything, Part 4

OK, back to our import use cases.

In Part 2 we left off after having imported the source XML for a publication (doc_01.xml) and its schema (book.xsd) and then having imported a second version of doc_01.xml without importing an unnecessary second version of the schema (because we were able to tell, through the intelligence about XSD schemas in our importer, that we already had the right schema instance in the repository).

We saw that the dependency relationships let us dynamically control which version or versions a link will resolve to at a particular point in time by changing the "resolution policy" of the dependency. This allowed us to import a new version of the schema without automatically making old versions invalid. It also gave us the choice of making some versions invalid or not as best reflected our local business policies.

Note that the storage object repository (Layer 1) doesn't care whether or not the XML documents are valid--it's just storing a sequence of bytes. It's the business processes and processors doing useful work that cares or doesn't. This is why we can put the repository into a state where we know some of the documents in it are not schema valid.

Also, it should be clear that whether you allow import of invalid (or even non-well-formed) documents is entirely a matter of policy enforced by the importer. For example, you could say "if it's not schema valid it's not getting in" or you could simply capture the validity state in metadata, as we've done here. You could have a process that will import everything, no matter what, but if, for example, an imported XML document is not well-formed, it will import it as a simple text file with a MIME type of "text" not "application/xml". It's up to you. If your CMS doesn't give you this choice out of the box you've given up a lot of your right to choose your policies.

In our repository we have the "is schema valid" property which will either be "true" or "false" (it could also be "don't know", for example, if you imported a document that referenced a schema you don't have or is in a namespace for which you have no registered schema).

Now imagine that we've built a Layer 3 Rendition Server that manages the general task of rendering publication source documents into some output, such as PDF or HTML. It's pretty likely that there's no point in rendering documents that are known to not be schema valid. With our "is valid" property the rendition server can quickly look to see if all the components of a publication are valid before it does any processing, which would be a big time saver.

Likewise, we can easily implement a Layer 3 management support feature that notifies authors or managers of invalid documents so they know they need to modify them to make them valid again. This is especially important if, as in this case, we might unilaterally cause documents to become invalid through no fault on the part of authors.

Anyway, back to the use cases.

I've stipulated that doc_01.xml represents a publication, that is, a unit of publication or delivery. This notion of "publication" is my private jargon but the need for it should be clear in the context of technical documentation and publishing. Most publishing business processes are driven by the creation of publications or "titles" or "doc numbers".

But there's nothing particular about the XML data that identifies it, generically, as the root of a publication and in fact there's no requirement that a publication be represented by a single root document (although that's a simple and obvious thing to do).

So we probably need some business-process-specific metadata to distinguish publication roots from non-publication roots. So let's define the metadata property "is publication root" with values "true" or "false". For doc_01.xml we set the value to "true" since it is a publication root.

Now we can do some interesting stuff--we can query the repository to find only those documents that are publication roots, which would be pretty useful. For example, it would allow us to narrow a full-text search to just a specific publication or just produce a list of all the publications. If we also have process-specific metadata for publications, such as its stage in the overall publication development and delivery workflow, we can see where different publications are. If we capture the date it was last published we can know if its up to date relative to some other related publications. You get the idea.

So import a new document already!

Fine. Let's import doc_02.xml, which is the root of another publication. It conforms to the latest version of book.xsd. We have authored this document with knowledge of doc_01.xml and have created a navigation link from doc_02.xml to doc_01.xml, like so:

<bibcite href="http://mycms/repository/resources/RES0001">Document One</bibcite>
for more information</p>

It also has a link to a Web resource:

<bibcite href=""/></bibcite>
for more information</p>

Note that the link to doc_01.xml uses a URL that points into our repository. But this is an absolute URL, which it needs to be as long as the document is outside the repository. It is a pointer to the resource for doc_01.xml which will be resolved by the repository into the latest version of that resource, which at the moment will be version VER0003, the second version of doc_01.xml.

This is the easiest case for import because it's unambiguous what the link points to in terms of data that is already in the repository. The address will still have to be rewritten on import but there's no question what it needs to be rewritten to and no question that the target is a version that is already in the repository.

By contrast, if we had instead authored the link as a reference to a local copy of doc_01.xml, the link might look something like this:

<bibcite href="../doc_01/doc_01.xml">Document One</bibcite>
for more information</p>

In that case, the importer has to figure out, by whatever method, whether the file at "../doc_01/doc_01.xml" is in fact a version of a resource already in the repository and whether or not this local version of doc_01.xml is itself a new version that needs to be imported. Again, figuring this out cannot be automatic in the general case but depends on implementation-specific mechanisms.

This raises the question of how the link was authored in the first place. It's unlikely the author looked inside the repository to figure out that doc_01.xml is really RES0001 and then typed the correct URL. So the authoring tool must be integrated with the repository such that the author can request a list of potential reference targets, pick one, and have the most appropriate address put into the href= attribute value. So there must be some sort of integration API that the repository exposes that the authoring tool can use.

Note too that the details of the link authoring are completely schema-specific as are the policy rules for what can be linked to. In this case, let us assume that the "bibcite" (bibliographic citation) element can only link to entire publications. That's the easiest case because it only requires us to point to entire storage objects, which we can do with just our storage object repository and which is the easiest UI to create (a flat list of publications). Note also that by adding the "is publication root" metadata property we've enabled the creation of just this selection aid since now we can query the repository to get a list of publication root documents. For a complete implementation we'd probably want to capture things like document title and document number (if it's known) as storage object metadata just for convenience (we could always look inside the documents at the time we build the UI but that would be time consuming, easier to capture it on import or get it from somewhere else once).

If rather than pointing into the repository we pointed to a local working copy, then the UI could be as simple as just putting up a file chooser and letting the author figure out which file is a publication root (which they would probably know if you've imposed a consistent file organization scheme, such as putting each publication in its own directory under a common root directory). Or you could export sufficient metadata to enable the same UI as before or you could ask the repository, as above, but then, because the repository remembered where stuff was checked out, it knows what local URL to use.

Which approach is best depends a lot on how the data will be used or handled outside the repository. If your authors are always connected to the repository, no reason not to point to it directly. If your authors need to able to work offline then you'll have to go the local working copy route. And of course you can support both modes of operation from a single repository because it's entirely a function of the importer and exporter logic.

In any case, on import the resulting URL for pointing to doc_01.xml is "/repository/resources/RES0001", that is, a relative URL (because everything's in the same storage space at this point). We also determine that we don't need to create a new version of doc_01.xml so we don't.

For the second bibcite, the one to the DocBook site, what happens? The importer could blindly look to see if it has anything for the DocBook site, see that it doesn't, and start importing all the HTML from the site as new resources and versions (XIRUSS-T includes a default HTML importer that will do this). But that's probably not what you want to do.

So the importer has to have some rules about what things are and are not ever going to be in the repository and external Web sites are probably never going to be in the repository. So on import the importer does not rewrite the href= to the DocBook site.

However, it could do something very interesting. It could create a new resource and version in the repository that acts as a proxy for the DocBook Web site. This would be useful because then we can create a dependency relationship between version doc_02.xml and the DocBook Web site without having to literally copy the Web site into our repository. This lets us manage knowledge of an important dependency using the same facilities we use for all our dependencies and gives us a way to capture and track important properties of the Web site using our local metadata facilities. The version is a version that is just a collection of properties with no data (although we could capture data if we wanted to, for example, to capture a cache of the state of the target Web site at the time we did the import).

So how has the repository changed following the import of doc_02.xml?

- Created new resource RES0005 and new version VER0005 for doc_02.xml.

- Set the "is publication root" and "is schema valid" properties to "true" for VER00005.

- Created new resource RES0006 and new version VER0006 for

- Created three new dependencies from VER0005 to the following resources:

- "governed by" dependency to RES0002 (the book.xsd schema)

- "document citation" to RES0001, reflecting the first bibcite link

- "document citation" to RES0006, reflecting the second bibcite link

Now lets do something useful with this data we've worked so hard to create: print it.

Let's say we have an XSLT script that converts book.xsd documents into XSL-FO for rendering into PDF. To apply this script to a publication we need only point our XSLT engine at the publication root version and style sheet and away it goes, e.g.:
c:> transform http://mycms/repository/versions/VER0005 book-to-fo.xsl >

Because our repository acts as an HTTP server we don't have to do an export first.

But there's an important question yet to answer: what does our XSLT script do with the various links?

In doc_02.xml we have to links to separate publications and those links need to be published in a form that will be useful in the published result. What does that mean?

In this case, we're publishing to PDF so we can presume that we want the links to be navigable links in the resulting PDF. Easy enough. But what will the links be to?


In the case of the link to the DocBook Web site that's pretty easy: just copy the URL out as it was originally authored (or as constructed through the use of our proxy version which will have had to remember the original URL or the normalized URL for the target Web site). No problem. Unless we use the proxy object, in which case either the XSLT has to know how to translate a reference to the proxy into a working URL, e.g., get the proxy object, get the appropriate metadata values, and go from there, or the repository has to provide a "getUrlForWebSite()" method that takes a Web site proxy object as input and returns the best URL to use for getting to the Web site itself. This type of function could be characterized as "top of Layer 1" or "bottom of Layer 2" bit of functionality, in that it's generic but it's reflecting our locally-specialized version types. But in this case it's generic enough that it should probably be built into Layer 1. But since it deals with issues of link resolution and data processing it's arguably a Layer 2 functionality.

In any case, the Web site link is relatively easy.

But the link to publication doc_01.xml is a bit trickier: we almost certainly don't want the PDF to link to the original source XML, either as it resides in the repository or in some checked-out location. We want it to link to doc_01.xml as published. But what is that?

This is the tricky bit: if we haven't already published doc_01.xml then we either have to first publish it and then point to that result or we have to be able to predict in advance where it will be when published or we have to be prepared to post-process the published result (the PDF in this case) to rewrite the pointer to doc_01.xml as published at such time as we know where it is. And even then, if we move the PDFs around we may still need to rewrite the pointers.

This suggests that we need to be able to do pointer rewriting. For anything. But we already have a generic facility for that in our import/export framework. Happy day! All we have to do is implement code that knows how to do it for PDF and Bob's your uncle. [Have I had too much coffee this morning?]

This also suggests that the best place to publish to is the repository itself, because we can both easily serve the results from there and we can easily export them as needed, doing any pointer rewriting that might be necessary. We can also establish dependency relationships between the published results and their source data and capture any other useful metadata about the published artifacts. Because the core repository is generic there's no problem using it to store PDFs or anything else.

So we're starting to build up a set of components and repository features that together form a "Rendition Manager" that handles the generic aspects of publishing. This rendition manager needs to do the following:

- Get the input parameters for applying a given rendition process to a given version or set of versions.

- Provide the appropriate utility functions to rendition processors needed to get access to object metadata, resolve pointers, and so forth.

- Manage the import of newly-created rendition results back into the repository reflecting its knowledge of the inputs to the process. That is, while we can certainly have a generic PDF importer, we need a PDF importer that also knows that PDF doc_01.pdf was generated from version VER0003 of doc_01.xml and sets a dependency relationship reflecting that.

Some of this rendition manager can be built into the Layer 1 code, as discussed above (i.e., the API or protocol functions needed) but the management of the specific processors will be be a Layer 3 component. That is, conceptually, the Rendition Manager is a client of Layers 1 and 2, in just the way an integrated authoring tool would be.

But you must have some form of Rendition Manager in order to do manageable publishing from the repository unless you do everything via bulk export and ad-hoc processes.

This is an important question to ask of any full-featured CMS provider: do you provide features and components that either comprise a rendition manager or make creating one easy?

Ok, so we run our rendition process and create a new PDF, doc_02.pdf, and bring it into the repository. The link to doc_01.pdf uses the URL "/repository/resources/RES0007". The link to the DocBook Web site uses the url "/repository/resources/RES0008". In the repository we create the following new objects:

- Resource RES0006 and version VER0007 for doc_02.pdf. It's MIME-type property indicates that it is of type "application/pdf".

- Resource RES0007 (and no version) for doc_01.pdf. Surprised? This reflects the fact that we know that at some point in the future there will need to be a doc_01.pdf but we haven't created it yet. The resource object lets us link to it even though we haven't created any versions.

- Resource RES0008 and version VER0008 for the Web site The metadata would include the absolute URL of the Web site and anything else we can usefully glean from it.

- A dependency of type "rendered from" from VER0007 to resource RES0005 with policy "Version VER0006" indicating the exact version the PDF was created from.

- A dependency of type "navigates to" from VER0007 to resource RES0008, indicating the link to the Web site

- A dependency of type "navigates to" from VER0007 to resource RES0001, indicating the link to doc_01.xml.

Why did we create the dependencies from the PDF document? We'll need these should we ever need to export a set of inter-linked PDFs to some delivery location, i.e., the external corporate Web site, an online review server, our local file system, whatever. We also need to know whether or not all the link dependencies are satisfied. We also may need to know if the workflow states of the source publications are those required in order to complete a publishing operation, which we can get by navigating from a given PDF to its publication source to see if it is, for example, in the "approved for publication" state, or if it exists at all.

For example, lets say that doc_01.xml version VER0003 is in fact in the "approved for publication" state, as is the latest version of doc_02.xml. If we try to do the "publish to corporate Web site" action (a Layer 3 process), we'll first chase down all the "navigates to" dependencies so we can get the PDFs of targets that are PDFs. We navigate to resource RES0007 and discover that it has no versions. With no versions we can't go on--we have no way of knowing, with the repository data we have, what publication might correspond to this PDF resource. Hmmm.

One way to address this would be to create a "rendition of" dependency from versions to the renditions generated from them. But those dependencies would be redundant with the equivalent links from the renditions to their source versions. In thinking about it it makes more sense to create a resource-to-resource "rendition of" relationship.

This can be done with metadata on the resource object where the value is just a list of resources that are renditions of this resource. There's no need for indirection because we don't need to select a version, we just need to know that resource RES0007 (doc_01.pdf) is a rendition of resource RES0001. We need to know this because when we finally get around to rendering doc_01.xml we need to know what resource the PDF we create is a version of. The PDF-to-source dependency links will establish the version-to-version relationships.

Ok, so we do that such that resource RES0007 has the metadata property "rendition of" with the value "RES0001" (doc_01.xml).

Now when we go to do our publication, we resolve the navigates-to dependency from doc_02.pdf to resource RES0008, the Web site. We discover that this is a resource that is really outside the repository (by looking at the resource or version metadata, which I haven't shown). We see that it's a link to a Web site so we try to resolve the URL to make sure the Web site is at least still there. We can't really know if the Web site is still relevant without putting a human in the loop but we can at least catch the case where the Web site or specific Web resource is completely gone or unreachable.

Next, we resolve the navigates-to dependency from doc-02.pdf (VER0007) to resource RES0007 and see that it is a rendition of resource RES0001. We get the latest version, VER0003 and check its "approved for publication" status. It's "true" so we can continue. If it had been "false" we'd have to stop right there and report back that not all the dependencies are ready to be published externally.

But we still have the problem that there's no PDF for doc_01.xml. What do we do?

We could halt the process and report that somebody needs to render doc_01.xml or we could just do the rendering job ourselves as we know that all the prerequisites have been met. Let's do that. This creates new PDF doc_01.pdf, which we import into the repository just like we did doc_02.pdf, with all the same dependencies and properties and whatnot.

Now our requirement that all the local dependencies are satisfied is met. Everything's in the correct workflow state, so we now export the PDFs out in a form that can be placed on the corporate Web site. To do this we have to rewrite the URLs of the navigation links from pointers to PDFs inside the repository to pointers to PDF in what their locations will be on the corporate Web site.

This means that the exporter component has to know what the business rules are for putting things on the corporate Web site, either directly because the rules are coded into the software, or indirectly because, for example, the PDF version objects have metadata values that say what the location should be or must be.

Let's keep it simple and say that the PDFs are located relative to each other and in the same directory. This means we can rewrite the within-repository URL from "/repository/versions/VER0003" to "./doc_01.pdf".


We've finally produced some usable output from our system. Time to go home and celebrate a job well done.

Let's review what we've done and seen:

- We've taken a system of two inter-linked publications through a cycle of authoring and revision and publication.

- We've created document-to-document hyperlinks using services provided by the Layer 1 storage manager coupled with Layer 3 customizations integrated into our authoring tool (had I made it clear that authoring tools are Layer 3 components? That should be obvious by now as, except for simple text editors, they're all about the semantics of your documents.).

- We enabled sophisticated workflow management reflecting local business rules and processes just by adding a few more metadata properties to our version, resource, and dependency objects.

- We created a Rendition Manager that can manage the creation of renditions from our documents such that the rendered results are themselves managed in the repository, which is a requirement in order to support processes such as publication to a corporate Web site or any operation that requires address rewriting on export.

- We created a Layer 3 component that manages the "publish to corporate Web site" action by using storage object metadata and dependencies to establish that all the necessary prerequisites are in place (workflow state, existence of renditions) or, if necessary, use the Rendition Manager to produce needed components (doc_01.pdf).

- We introduced resource-to-resource links using simple metadata values on resources to establish relationships between resources to support the case where a resource may be created in advance of having any versions.

- We made it clear that our repository can not only manage any kind of storage object but that it's essential that it do so in many cases. Thus we put our PDF renditions back into the repository from which they can be accessed directly for viewing or exported for delivery from other places.

- We saw the utility in creating "proxy" versions for things we don't own or control so that we can manage our dependencies and metadata on those resources within the repository, keeping all our processing closed over resources, versions, and dependency objects. Very important. You can do all sorts of really useful and clever things with these proxies, including mirroring resources managed in other physical repositories as though they were in yours. [Pinky to corner of mouth, low evil chuckle. Mischievous faraway glint in eyes. Absently pat head of Mini Me.]

This is pretty sophisticated stuff and is more than a lot of commerical systems do today (while at the same time they do stuff you don't want or need). And we've done it all with relatively simple software components that are connected together in clever ways. Because all the Layer 3 stuff we've invented for this use case can be built in isolation both from the Layer 1 repository and from each other, they can be individually as simple or sophisticated as needed or as you can afford. For example, the Rendition Manager could really just be a bunch of XSLT scripts or it could be a deeply-engineered body of Java code served through a full-scale Web server and designed to handle thousands of rendition requests an hour. But the minimum functionality of each of these components is pretty modest and no single component represents an unreasonable implementation difficulty--it's all very workaday programming: get an object, get a property value, chase it down, check a rule, get the target object, check it's properties, apply a business rule, run a process, move some data, create a new resource, set some properties, blah blah blah is it lunch yet?

That is, the requirements might be broad but the implementation need not be deep and it certainly doesn't need to be monolithic or exclusive.

I know that at this point you've been given a lot to think about and if you've read this far in anything like one go your head is probably spinning. Mine is and I've written this stuff.

Hopefully I've succeeded in binding these general concepts and architectures to realistic use cases and processes that make it easier to see how they apply and where their power as enabling abstractions and implementation and design techniques really accrue.

Note too that we've only done document-to-document links--we haven't said anything about element-to-element links and what that might imply. That's actually because most of the inherent complexity is at the storage-object-to-storage-object level. Going to element-to-element linking really only complicates user interfaces and presents some potential scale and performance problems (because of the sheer potential volume of data to be captured and managed because there're typically orders of magnitude more elements than storage objects). But the fundamental issues of resolution and dependency tracking are the same so we'll see that we really don't have to do much more to our system to enable creation, use, and management of element-to-element links. We've done almost all the hard work already. And it wasn't really that hard.

[As an aside: I'm pretty happy with how this narrative is coming out even in this first draft. I fully intend to edit it together into a more accessible, coherent form as soon as I get it all out of my head, which shouldn't take too much longer. I hope.]

Next time: element-to-element linking (probably)



Anonymous Anonymous said...

"So the authoring tool must be integrated with the repository such that the author can request a list of potential reference targets, pick one, and have the most appropriate address put into the href= attribute value"

THANK YOU!! This is my gripe about implementation of CMS as version control systems. Without integrated authoring pieces, you require (typically unskilled) laborers to make the policy for you of how to identify when a new asset is related to an existing one. What then ensues is at least one user who decides to sandbox the whole system onto a server or shared drive thereby creating more versioning nightmare and sometimes, ultimately, dropping of a system because of mis-utilization.

This to me is an excellent point of enterprise content management wherein the tools and the CMS integrate tightly. I am not sure I've seen one do that yet (although I've only been exposed to TEAMS and, briefly, Documentum). Unless your workers are skilled and understand the concept of versioning and lifecycle management, you NEED that overhead of integrated authoring and CMS.

4:17 PM  
Blogger Eliot Kimber said...

Even if you have authors that are skilled and understand the concepts of versioning and lifecycle management you still need the overhead of integrated authoring and CMS. The fact is that creating version-aware links by hand is just not practical or reasonable to do unaided. If you're going to invest big money in a management system it doesn't make any sense to not also invest in direct task support for authors.

That's in fact why I tend to value authoring tool features over CMS features--you can get by with minimal CMS features (as I've stated, I've proved it using CVS) but you can't get by without authoring features needed to make creation of the information simply possible.

9:41 PM  

Post a Comment

<< Home