Dr. Macro's XML Rants: September 2006

Saturday, September 16, 2006

XQuery: Not So Bad After All

I've recently finally had a need to use XQuery for something--up to this point everything I've done with XML since XQuery was solidified was with XSLT and DOM programming. I've been observing XQuery since the effort was started back in the dim mists of time and, like XML Schemas, had little hope that it would ever see the light of day, for the simple reason that there seemed to be too many cooks and too many different requirements for the committee to ever reach a useful concensus on things like syntax and semantics. In particular, there seemed to be a fundamental chasm between the database people who wanted an SQL for XML, and the document people, who wanted XSLT with result sets. But that was from a distant vantage point with little direct visibility into the activity other than getting all all the function-related committee email because I'm a member of the XSL working group (not that I actually read any of that email unless the subject line was particularly intriguing and I had the time to devote to reading it).

But of course my pessimism was unfounded and XQuery has emerged as a solid and useful specification with a number of implementations.

I've now been constructing simple XQueries for a couple of weeks and I must say it's pretty cool to do with a simple query what would take a good bit more work to do with an XSLT script (given a running XQuery-supporting repository, of course).

I also found Mike Kay's Learn XQuery in 10 Minutes to be a very helpful startup guide, providing just the how-to information I needed to get the basic syntax and techniques. The rest of XQuery (at least the part I've used) is pretty obvious and intuitive to anyone familiar with XSLT.

Of course, I haven't had the opportunity or time to determine to what degree various tools provide complete and correct implementations of XQuery, but I'm sure I will. By the same token, the standard has a solid set of test cases that make it pretty hard to not know if you are doing it both correctly and completely.

My main concern would collation, which was very broken in XSLT (in the sense that the mechanism for specifying custom collators for doing sorting was not well standardized and was only usefully implemented by Saxon for the purposes of doing XSLT processing of localized documents [i.e., back-of-the-book index collation]). I know that XSLT 2 (and therefore XQuery, which share the same collation semantics) have attempted to be more general but when I first looked at what was in Saxon 8 (a couple years ago now) it wasn't quite what I needed (you had to declare a separate collation URI for each locale, while I wanted a single collation URI that named a collator that then did the right thing at the right time based on an outside configuration mechanism). [With Saxon 6 you had to implement per-locale classes that Saxon used based on an invariant mapping of locale names to collator class names. At least the XSLT 2 mechanism is more general.]

But I haven't had time or business need to push on the XSLT 2/XQuery collation mechanism for a while so I really don't know. But I suppose I really should, because as far as I can tell I'm about the only person who really worries about this particular issue (I developed a generic index configuration and collation support library for use with Saxon, which is available here: Internationalization Support Library [note: log-in may be required. If this is a problem, send me an email and I'll forward you a copy.]. Note that this code is equally applicable to XSL-FO 1.0 and the new indexing support in XSL-FO 1.1 as the FO indexing is only about constructing sequences of page numbers and not about sorting the index entries themselves. In addition, this code is equally useful for things like generated glossaries.)

One interesting question that I've already run into is when engineering a complete Web site to serve XML data and queries against it, how much should be done in XQuery alone and how much should be done with more traditional Web site technologies such as JSP or Ruby? You can of course use XQueries to generate HTML pages that reflect the query results and can therefore use XQuery exclusively to build a Web site (given some sort of CGI-like facility, such as the extensions that MarkLogic provides or just everyday CGI scripts) but should you?

My initial instinct is that you should not, that good engineering practice argues for clear separatation of concerns and that XQuery should focus on queries and something else should focus on the user interface. But I'd be curious if anyone has a strong counter argument. One of the things that's attractive about the do-it-all-in-XQuery approach is that you can build stuff really quick because there's little overhead, so it is good for proofs of concepts and demos. But I can't see it being a sustainable approach for production Web sites (although I'm sure more than one person will answer that they've been doing it for years now).

Labels: xquery

Wednesday, September 13, 2006

XCMTDMW: Why Indirection is So Important for Authoring

[This is a continuation of the discussion of linking and addressing that left off here: XCMTDMW: Element to Element Linking: Overview]

In my last installment we saw that when we had a direct link to an element, that when a new version of that element was created in a way that changed the location of the element so that the existing pointers to it would no longer resolve to the correct element, that we had to create new versions of the documents with the pointers.

That is, we started with doc_01.xml, which has an XInclude link to a warning:

<?xml version="1.0?>
<doc>
...
<:xi:include href="../common/warnings/dont_run_scissors.xml"/>
...
</doc>

Which points to dont_run_scissors.xml:

<?xml version="1.0"?>
<warning>
<p>Don't run with scissors.</p>
</warning>

Thus at time T[1] we have two resources (in the SnapCM sense), one for doc_01.xml and one for dont_run_scissors.xml, and one version of each resource.

The XInclude link relates itself to the warning element that is the root element of dont_run_scissors.xml.

At time T[2] we create a new resource warnings.xml and create the first version of that resource by copying the warning from dont_run_scissors into it, along with other warnings we have:

<?xml version="1.0"?>
<warning_set>
<warning>
<p>Don't run with scissors.</p>
</warning>
<warning_set>
<warning>
<p>Don't stand on the top rung of a step ladder</p>
</warning>

At this point, time T[2] (each T time reflects a snapshot in time of the repository state, reflecting the commitment of an invariant version of one or more resources into the repository), if we resolve the XInclude link in doc_01.xml, what will we get? We'll get the warning element in dont_run_scissors.xml, for the simple reason that that's where the XInclude is pointing to in the first (and so far only) version of resource doc_01.xml.

But we know that there's a new version of the warning (how we know is a question that we'll come back to later--for now assume that we talked to Jane, the Warning Mistress, and she happened to mention that she had decided to reorganize all the warnings into a single document).

As the author of doc_01.xml that leaves us with a choice: do we leave things as they are, knowing that we'll forever get that original version of the warning or do we react to the change in location of the warning? Of course we must react because we want to make sure we get the latest version of the warning content.

This means that we must create a new version of doc_01.xml that differs only in the form of address used to include the warning. Note that the warning text has not changed nor has the meaning of the warning or how we are using it in doc_01.xml[v1] (meaning version 1 of resource doc_01.xml).

This is a problem: nothing about the information content of the elements involved has changed: the content is the same, the semantics are the same, and doc_01.xml had no other need to change. Yet a simple relocation of the warning element forces us to create a new version of doc_01.xml. And not just doc_01.xml, but every document that uses that warning, which could be a very large number of documents indeed, if it's a common warning.

This is clearly not good. It is especially not good if our addresses are to specific versions of resources and not to resources (which are then resolved to specific versions using some policy such as "latest visible version"). This is because the simple act of creating a new version would require that all pointers to previous versions would at least have to be evaluated and quite likely new versions of the documents containing those versions would also have to have new versions created, which would require analysis of the pointers to those versions and so on. In the worst case, creation of a single new version of a resource requires creation of new versions of all the other documents in the repository. Not good.

How can we address this problem? Here are some options:

A. Disallow the reorganization (effectively requiring that every warning be a separate document)

B. Assign each warning some sort of universal identifier by which it can be addressed irregardless of its storage location

C. Somehow automate the creation of the new versions with rewritten links when the target element changes in a way that requires a new pointer.

None of these options is particularly attractive. Option A either turns your respository into a lava flow that can't be changed once constructed or requires you to, as a matter of practice, decompose everything at the lowest level at which you might want to reuse elements, creating a potentially huge collection of small objects, most of which will never in fact be used.

Option B works but requires that you use non-standard addressing mechanisms (because there is no standard-defined space of universal identifiers for elements by which they can be addressed using a standard-defined means, at least for W3C standards. That is, if you use some sort of repository-specific object ID or UUID or whatever, it is, unavoidably, proprietary and you (or your repository provider) will be on the hook for implementing all the addressing infrastructure needed to work with those IDs. This ties your data to a proprietary system. People do it all the time but that doesn't make it a good thing, especially when it can be easily avoided.

Option C doesn't really solve the problem, it just hides the problem from users and slows the system down. Don't do that.

The problem of course is that the problem cannot be solved using direct element-to-element addressing. The problem can only be solved by introducing at least one level of indirection.

That is, rather than pointing to the warning directly, we point to something that then, by some mechanism, gets us to right version of the warning at the right time.

In programming terms this is basic pointer stuff. But in XML linking terms it's a problem because there is no W3C standard that defines any form of indirect address resolution. Think about that.

It does in fact make some sense, because the Web standards are almost entirely focused on information delivery, not authoring. For delivery there is little value in indirection because the information to be delivered is invariant--that is, when you publish one page you can publish all the other pages as well and therefore everything can just point directly to what it needs to--versioning issues don't really apply in the delivery space in the same way they do in the authoring space.

But for authoring we must have indirection. Of course the HyTime standard's addressing stuff is nothing but indirection. It's so indirect you can barely make out what you can actually do. But since HyTime is not a realistic option we need something more Web friendly. To satisfy that requirement I defined the XIndirect specification and submitted it as a Note to the W3C: XML Indirection Facility (I presented a paper on it at Extreme Markup 2003: XIndirect).

XIndirect is the simplest thing that could possibly work. It defines two element types, one of which is just for convenience:

- indirector, which takes an href= attribute that points to the desired ultimate resource target (I should probably update the note to reflect the XInclude-style href=/xpointer= attribute pairs but the function would be the same). An indirector element can have a unique ID. The indirector element has the default semantic of "redirect to my target resource" when it is itself the target of a pointer (of any sort).

- indirectorset, which contains zero or more indirector elements. It's just for convenience, for example, if it's useful to group a bunch of indirector instances together under a common root element or to bind documentation or application-specific metadata to a bunch of indirectors or whatever.

As far as the XIndirect spec is concerned, indirector elements can occur anywhere--it doesn't matter where they are.

One important aspect of XIndirect is that it can be used unilaterally with any other linking or addressing scheme--all it requires is that the software that does the address resolution be XIndirect aware so that it knows to resolve the indirections. At the implementation level this is usually expressed as a recursive function that resolves an input pointers and, if that pointer resolves to one or more indirectors, applies itself to those, otherwise it returns whatever it got that wasn't indirectors. A complete implementation also requires cycle detection and hop counting so you can avoid infinite loops or can bail if the resolution is taking too long, but those are frills.

Given the XIndirect spec and the indirector element we can now start building a more-or-less standards-defined, pure-XML indirect address management system.

The key to this system is taking advantage of the SnapCM resource concept to create proxy storage object resources for the element targets of our links. Remember that in our abstract versioning system, only storage objects are versioned. Therefore it follows that to version something you must either make it a storage object or create a storage object proxy for it. The first option is option A above: decompose at whatever level you need so you can version something. The section option is our new option D: use indirection.

OK, lets turn back the clock to time T[0], before we had created doc_01.xml and its XInclude link to the warning. At time T[0] we have the resource don_run_scissors.xml and its first (and only) version, which is just the single warning element as shown at the start of this post.

We are tasked with creating new resource doc_01.xml and creating its first version. As part of that, we need to XInclude the warning. Here's what we do:

1. We create a new document instance (outside the repository) and create an XInclude link to the warning (inside the repository). We create this link by using some sort of editor customization that lets us pick the link target from a list of available targets in the repository and constructs the link syntax for us. We select the don't-run-with-scissors warning (note we're selecting the warning, not the XML document that contains the warning--our intent as authors is to link to a warning--we don't care where that warning element is stored).

2. The system does the following:

a. It looks in its "where-used" index and sees that the selected warning has not yet been used as the target of any links.

b. It creates a new resource rtd_01.xml and creates the first version of that resource with the following content:

<?xml version="1.0"?>
<indirector xmlns="http://www.isogen.com/papers/xindirection.xml"
href="../common/warnings/dont_run_scissors.xml[v1]"
/>

It commits the first version of rtd_01.xml to the respository.

c. In our doc_01.xml file it creates this XInclude link:

<xi:include href="/rtds/rtd_01.xml"/>

The new resource rtd_01.xml now acts as a proxy for the warning element. Each version of the resource contains a hard pointer to the warning element's location at the time the version is created.

We commit our new doc_01.xml file as the first version of resource doc_01.xml, creating a new snapshot at time T[1].

When we go to process doc_01.xml[v1] we resolve the XInclude link as follows:

1. We resolve the url "/rtds/rtd_01.xml" to the resource rtd_01.xml. Applying the default resolution policy of "latest visible" we get version rtd_01.xml[v1]. Because there is no fragement identifier we resolve the URL to the root element of the version, which is the indirector element.

2. We see it is an indirector and therefore resolve its href= to its target, which is to specific version v1 of resource dont_run_scissors.xml. Again there is no fragement identifier so we resolve to the root element of the version, which is the warning element.

3. The warning element is returned as the ultimate target of the XInclude link and the normal XInclude semantics are applied to it.

Whew.

Now Jane the Warning Mistress decides to reorganize all the warnings into one file. She does this in an authoring tool that is integrated with our repository such that at the start of the editing session it gets a list of all the elements in the document that are pointed to by any links, whether they be direct links or indirectors. This list allows the editor to do what it can to keep the links consistent as the data is changed in the editor. For example, it might (as a matter of policy) disallow the deletion of any element that is a link target or it might make sure that if a target element is copied that the copy is not confused with the original or if an element is copied from a different resource that it remembers that the copied element was a link target in its previous context.

In this case Jane copies the warning from dont_run_scissors to her new all_warnings.xml document. The editor sees that the warning was the target of an indirector link and remembers that so it can do the right thing at commit time.

When Jane commits her new all_warnings.xml it creates a new resource all_warnings.xml and the first version of it. As part of the commit process, because the authoring tool knows that the copy of the warning in all_warnings.xml was a copy of the original warning, it also creates a new version of rtd_01.xml that reflects the new version of the warning:

<?xml version="1.0"?>
 <indirector xmlns="http://www.isogen.com/papers/xindirection.xml"
 href="../common/warnings/all_warnings.xml[v1]#xpointer(/*/warning[1])"
 />

This creates a snapshot at time T[3] that now includes the new resource all_warnings.xml and its initial version.

Now, when we go to process document doc_01.xml[v1], when we resolve the XInclude, we will see the indirector, resolve it to the latest version of resource rtd_01.xml, rtd_01[v2], which in turn points to the warning element in all_warnings[v1].

Note that we did not need to do anything to doc_01.xml for this to work--only the indirection was versioned--all uses of that indirection are unchanged, because they point to the indirection resource and not a specific version.

By the same token, if we process document doc_01[v1] in the context of snapshot T[1] we will resolve the XInclude to the warning in its original location.

This solves the link management problem inherent in doing hard pointing, at least as far as link representation goes. That is, given that you know to create the indirectors, once created, they just work, given XIndirect-aware address resolution where its needed. Of course, there are still a few challenges here.

First, this does require pretty sophisticated authoring tool functionality for it to be practical--while the indirector elements could be created by hand no sane person would expect other sane poeple to do it as a standard practice. However, this level of sophistication is required regardless of how you manage your links and addresses: it's simply a fact that building a system that supports linking completely has to be sophisticated and there's no getting around it. You can take some shortcuts if your authors are reasonably savy and you can impose some reasonable constraints, but a fully-realized link-aware authoring environment is non-trivial. It also depends heavily on the details of how your repository manages links and addresses and indirections and versions and resources and they all do it differently.

Finally, there are some inherent rhetorical challenges that can come up once you start creating version proxies for elements.

One is that the same target element might be linked to for different purposes (use-by-reference, navigation, semantic association for a specific purpose, whatever). Each of these uses might really want to have its own separate indirector resource that reflects the semantic of the thing used rather than just the initial thing that happened to reflect that semantic (I think this can be usefully cast as a naming problem of the sort that Norm Walsh is currently discussing on his blog). That is, if your authors are sophisticated in their use of links for different semantic purposes they may well need sophisticated indirection support as well. At a minimum, the system has to be prepared to manage multiple indirectors for the same target element.

Another inherent problem is that of notification of linkors. That is, when a new version of a element to which you link is created, you, the owner of the link, need to be informed that the new version exists so that you can decide how to react. If your reaction is simply to use the latest version you do nothing (assuming your link is using the default "latest visible" resolution policy). If your reaction is to continue to use an older version, then you either have to create a new version of your link to change the address to point to either a specific version of the indirector or point directly to the specific version of the target element (both are functionally equivalent) or you have to modify the resolution policy associated with the repository-level dependency object that reflects the version-to-resource link. This creates a new version of the dependency link but doesn't require a new version of the document that contains the link itself.

In practice these two problems tend to be limited by both keeping the link semantics pretty simple (transclusion and navigation and that's it) and by imposing invariant policies for reacting to new versions that allow the link reaction to be automatic (i.e., always resolve to the latest visible version).

Finally, why did I call my indirector document "rtd_01"? "RTD" stands for "referent tracking document", that is a document (in the XML sense) that tracks the versions of a "referent", that is, the target of a reference of any sort.

To sum up:

- For authoring purposes, the basic problem of change ripples cannot be solved except through the use of indirection.

- The indirection provided by repository-level dependency links (version-to-resource links) is necessary but not sufficient when you need to address elements that are not document root elements.

- The XIndirect W3C Technical Note provides the simplest possible indirect addressing syntax for use in a W3C/XML environment

- By creating RTD documents (element proxies) for elements, we can track the version history of individual elements regardless of where they are stored and regardless of whether or not they are document root elements.

- Tracking the version history of an element requires maintaining knowledge that given element has a proxy during the editing process so that you can accurately create new poxy versions following a change in location of the element.

- With this mechanism, it doesn't matter how you address an element from an indirector--unique IDs have no particular functional advantage over simple XPaths, for example. However, IDs might have an advantage if you have to guess at the correspondence between an older version of an element and new versions that are being put into the repository, for example, as the result of an upload of a new version that was edited outside the scope of the repository. But even there IDs can only be a clue.

- The actual data processing involved in resolving indirection is not a big deal (and I include a sample XSLT implementation in the XIndirect note) but there are some things to be careful of, in particular cycles and over-long sequences of indirectors.

- The indirection doesn't need to literally be represented as XML documents (even tiny one-element documents). You could of course do the same thing using relational tables or whatever, and you probably should for scalability and/or performance (although I think that a tool like MarkLogic would probably scale and perform pretty well for resolution of XIndirect indirectors and answering were-used questions. Hmm. definitely worth experimenting with...).

I've discussed all of this in the context of an abstract versioning model (SnapCM) and an abstract repository that manages resources, versions, and dependency links.

You can use this abstract model to help you evaluate commercial or one-off XML CMS systems to see how close they come to being able to manage links completely. You will find that some, such as X-Hive's Docato product, come pretty close. Others do not.

This essentially wraps up my discussion of XML Content Management The Dr. Macro Way--we've taken it all the way through to the management of element versions with sustainable link management using a completely generic and standards-based approach.

About the only thing I haven't covered is the SnapCM notion of "sync", which I think is important but it is not necessary per-se for doing link management as described here (although it is essential for configuration management of the linked documents).

Maybe that's what I'll talk about next, who knows?

Labels: XCMTDMW "xml content management" indirection xinclude

Saturday, September 02, 2006

XIRUSS-T Update: Eclipse Plug-in

Last Thursday I bought a copy of Eclipse: Building Commercial-Quality Plug-ins (2nd Edition) by Eric Clayberg and Dan Rubel with the goal of creating an Eclipse plug-in XIRUSS client. The book is very well written and authoritative and the Eclipse plug-in framework is a remarkable piece of work, both in its overall design for extension and integration and in its execution. It makes creating plug-ins remarkably easy (at least from a getting started standpoint) and the SWT/JFace libraries for user interface components feel more solid and logical than AWT (not that I have any particular basis on which to judge as I've done very little UI development over the years).

Anyway, I had to do some business travel over the weekend (tip: if you're going from Austin, Texas to Norwalk, CT, don't try to drive from Newark Airport--take the train) so I packed the book (which fortunately isn't enormous, just big) and made some progress.

As of this morning I have a very simple repository tree viewer that will reliably navigate the branch-snapshot-version structure of a running repository. The next step will be to make it sufficiently sophisticated to be an actual useful viewer, such as being able to refresh the view, doing filtering, and do sorting. Once I get that going then I can start adding actions to the tree, such as creating new mutable snapshots and committing them, using drag and drop to organize versions inside organizers, and so on. The next step after that (or possibly before that) is to implement a property page view that can show the properties of repository objects. Once that's in place, then I can start working on integrating editing from the repository, which shoudn't be too hard but at that point I'll be doing deeper integration with the Eclipse framework. If I understand the general Eclipse framework, I should eventually be able to hook the XIRUSS client into the Team infrastructure at which point any Ecplise-managed resource could be managed in XIRUSS more or less transparently, although I can't imagine that would be trivial.

I don't expect any of this to be hard but there are a lot of moving parts and a lot of details to attend to and things like listeners and event handlers that are somewhat outside my normal pipeline tree-walking data processing programming experience.

At work I've been put onto some high-visibility sales support activities which have the downside that I've less time and energy to spend on XIRUSS but the upside that I'm getting to push pretty hard on current XML content management and indexing tools. I've already reported on MarkLogic and my opinion has, if anything, only improved as I've worked more closely with the software and the folks at MarkLogic. I'm also learning and using XQuery for the first time, which is kind of fun (I simply had no need to use it up until now as XSLT and XPath did what I needed).

I realize that y'all realize that most of this XIRUSS status reporting is for my own benefit and that nobody's waiting breathlessly for me to get this code to a more usable state but I would be interested to know if anybody is either trying the code or otherwise tracking my progress. My main motivation is to get the code to a state such that when I start writing in detail about the versioned linking scenarios there will be running code that demonstrates the management functionality.

Labels: xiruss eclipse "eclipse plug-in" marklogic

Dr. Macro's XML Rants

Saturday, September 16, 2006

XQuery: Not So Bad After All

Wednesday, September 13, 2006

XCMTDMW: Why Indirection is So Important for Authoring

Saturday, September 02, 2006

XIRUSS-T Update: Eclipse Plug-in

About Me

Disclaimer

Blogs of People Whose Opinion I Respect

Previous Posts

Archives