Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Tuesday, July 25, 2006

XCMTDMW: Re-Use and Not Really Re-Use

In order to fully appreciate the inherent and potential challenges involved in going beyond simple storage object management to XML-aware semantic management we have to first understand what the nature of the semantic structures and relationships are, as well as understand something of their mechanics and plumbing. To that end I want to focus on re-use as an activity and as enabled through semantic information embedded in XML documents. As outlined in the previous post, re-use is both one of the more, if not the most, compelling uses of linking and of XML in general and is a special case that, having been satisfied, can provide an implementation base for satisfying any other linking requirements. It is also an activity that drives the need for other sophisticated semantic management features such as full-text search and taxonomic metadata. It also tends to complicate development workflows, potentially driving the need for more sophisticated ways to track and monitor XML documents through their development cycles and over their service lifetimes.

It's also something that almost everyone who does it or who implements systems that support it gets wrong to one degree or another. In my humble opinion. Although I must say, and it should become clearer why, that the emergence of both XInclude and DITA as established standards is helping to change that gloomy picture a bit, but not entirely (for neither XInclude nor DITA get it 100% correct either, although much more in the nature of quibbles than fatal flaws).

What is "re-use"?

In the context of XML-based documentation I define re-use as the use of a given XML element in two or more contexts at the same time.[You could of course re-use individual character strings or attributes, but the same principles apply and it simplifies things to limit ourselves to elements.]

The textbook examples in technical documentation are warnings that need to be presented without change in every manual for a given product type (e.g., the "don't be an idiot" warnings in every mobile phone manual) or the subtask that is included in a number of larger tasks (e.g., the "put away your tools and clean up your work area" subtask familiar to aircraft maintenance personnel).

More interesting examples would be re-use of information that is invariant across a family of products. For example, most sophisticated consumer electronics products, including mobile phones, network access points, DVD players, printers, and so on, are engineered using a common set of subsystems and software that are then combined in different ways or put within different cases to create different products. The documentation for these subsystems can be developed outside the context of any given product and then combined to reflect the specific components, options, and features used in a given product model. This can represent a significant overall savings in documentation development labor costs and reduction in development time, potentially decreasing time to market (if documentation production is currently a gating factor, which it can be, especially if the documentation must be translated into a large number of languages). It is these potential efficiencies and savings that tend to drive the push within larger enterprises to do sophisticated re-use. But even within smaller enterprises with less modest documentation challenges, re-use can have real value.

So how do you do re-use?

There are essentially two ways: use-by-copy and use-by-reference.

Use-by-copy is just what it says: you take an existing piece of information and you literally make a copy of it for use in its new location. This is often called "cut and paste" re-use. There is a place for use-by-copy. For one thing, it requires no special effort to do. On the other hand, as your need to track and manage small bits of information through their life cycles increases (for example, you have the goal of translating a given piece of information exactly once regardless of where it is used), the cost of use-by-copy increases.

Use-by-copy is syntactic re-use in that it operates at the level of the syntax of the data involved. Once you've cut and pasted a piece of data you've changed the syntax of the newly-created data set in a way that cannot, without either local conventions or significant extra work, be tracked or detected. There is nothing inherent in the copied result that tells you it was copied (and not just typed) or where it was copied from.

Use-by-copy has the characteristic that a change to the original data source cannot be easily correlated to any uses of that data in other contexts (because it is hard to track the copies) and a change to the data as copied cannot be easily correlated back to the original source of the copy (although that's a little easier to do simply because you can capture information about the copy source when you do the copy if you choose to implement that functionality). In the general case it is impossible to answer the "where used" question for data that has been copied with 100% accuracy, and typically it can't be answered at all because typically features for doing any tracking of copying are not built into systems (because they're expensive, especially in comparison to other forms of re-use). Because a copy is syntactically indistinguishable from data that was never copied it is difficult or impossible to apply specific business rules or processing to copied data distinct from rules applied to data not copied.

Challenge question: why is a reference to an external parsed entity use by copy and not use by reference?

Use-by-reference is the use of a piece of information via a semantic reference to it, that is, via a hyperlink. XInclude and DITA's conref feature are typical examples of use-by-reference links. Use-by-reference has several important characteristics:

- The data used exists in exactly one place, so there is no difficulty with coordinating different copies of it that may or may not be managed or manageable.

- The fact of the use can be easily seen and managed because the link creates an explicit and manageable connection in the using document between the data used and all of its use contexts. You can always answer the "where-used" question, either because you're indexing the link information as it is created in the repository or by brute force.

- The uses can be constrained and managed because the links themselves are first-class objects that can have their own element types, descriptive metadata, and precise business rules associated with them. For example, you can easily define a rule that says "in this structural context you are only allowed to re-use elements of type A, B, or C" and easily enforce that rule in an authoring tool or validation application.

- The processing and rendition implications of a given use are defined and implemented by the semantic processor, not the data parser. This means that you have complete flexibility over what is or should be allowed in terms of what can refer to what where and complete flexibility over what the data processing result applied to one of these references is for a given process.

- There is no requirement that the organization of the data for storage (as XML documents managed as storage objects) be the same as its organization for re-use. That is, use-by-reference links are element-to-element relationships, not element-to-storage object or storage-object-to-storage-object. This means that, in the abstract, it doesn't matter what storage object a given element is in, the mechanism by which you use it and the processing result of having used it are the same. This is very important.

There's a lot to say here, but I want to take a moment and present my general critiques of existing use-by-reference mechanisms so they can be in your mind going forward and to presage discussion topics that I will get to by and by.

- Too constraining. For example, DITA's conref as defined in the current DITA specification requires that the result of resolving a conref be schema valid. I understand, I think, why this requirement is imposed, but I think it is unnecessarily constraining and I summarily reject it out of hand. There I said it.

- Too generic for authoring. For authoring you almost always need to be able to say what is meaningful to use in what context (this is one reason DITA imposes the constraints it does). The easiest and obvious way to do this is through distinct element types and content models. For example, if I want to re-use sections it makes sense that anywhere "section" is allowed I would also allow an element that can be used to link to sections for re-use. If the element type is generic (i.e., xi:include), no can do, because xi:include has no way to express context-specific reference constraints. However, if you had a specialized form of xi:include, say "SectionUseByRef", you can both constrain where it is allowed and, with some simple attribute conventions (such as those defined by the old HyTime reftype facility), you can say clearly what is or isn't allowed to be referenced with whatever degree of constraint is appropriate for you. For this reason I assert that XInclude, while otherwise very useful (modulo a quibble about IDs), is not directly appropriate for authoring because it provides no means of specialization or reference constraint. But fortunately that's easily remedied. [See, a paper on gave on this subject at XML Europe 2004].

- Weak or broken addressing schemes. I like XInclude's addressing convention (href= + xpointer=) and now use it for everything I can. Other addressing schemes are either limited, non-standard, proprietary, or too idiosyncratic. The most obvious case is DITA, which defines its own addressing syntax and semantics that cannot be easily rationalized with other existing addressing standards. This is probably my biggest beef with DITA as a specification (I have no real beef with DITA conceptually or as a general body of practice--only with the details of its current specification).

- They're not really use-by-reference. This brings us back to the challenge question.

Why is a reference to an external parsed entity use by copy and not use by reference?

The answer is: because it's syntactic and not semantic. An external parsed entity reference is a syntatic reference that is processed and resolved by the XML parser and that is, in general, invisible to any downstream process. For example, XSLT provides no standard way to operate on entity references (internal or external). This is as it should be because the use or non-use of entities has absolutely no effect on the semantic content of the XML document: the element structure is the same, the text content is the same, and so on. In addition, an entity reference is not an object in any useful sense: it cannot be typed, it cannot hold any metadata directly (in the way that XML elements can), there is no standard way to impose constraints on their use.

In short, for the purposes of processing XML documents (as opposed to editing them), there is no useful or, with XSLT, detectable, difference between having used some data by literally copying it into your starting XML file and using that same data via an external parsed entity reference.

Have I made my point?

In short, external parsed entities are not, themselves, objects that have a reliable independent existence. This was especially true in SGML, where all documents had to validate against a DTD and entities did not have to align to element boundaries. With XML we at least removed the requirement for DTD validity and require entities to align with element boundaries, but we don't, for example, require entities to consist of exactly one root element. That should be a warning.

Here's the key test to determine whether something is doing use-by-reference or use-by-copy, a test that should be applied to every XML-aware content management system you are considering using:

Create two documents, DocA and DocB. For both documents, define a schema that defines an element with an ID-type attribute (that is, a element of type xs:ID). Now create a new file EntZ.ent that looks like this:

<-- start of external parsed entity -->
<foo><bar id="x">This is bar within foo</bar></foo>
<-- end of external parsed entity -->

Now in DocA declare the entity [Oops, using a schema and not a DTD? Cool--you can't use entities at all so this isn't even an issue. Proceed directly to Free Parking] and reference it:
<!DOCTYPE doc SYSTEM "doc.dtd" [
<!ENTITY EntZ SYSTEM "EntZ.ent">

Now validate DocA. What happens? The first problem is the element type "foo" may or may not have been declared in "doc.dtd". Since the XML fragment EntZ.ent has no namespace or schema associated with it is absolutely impossible to determine what abstract type it is given the information at hand. Of course in practice you know what document types your documents use and you would never create or be given an entity from a different document. Nope would never happen. Never. Under any circumstances. Unless of course you're trying to use information managed at the corporate level that could have been created by anyone within your multi-national giant of an enterprise at any time over the last 20 or 30 years (I'm thinking into the future here, but there are certainly enterprises, IBM comes to mind, with existing 20-year bodies of SGML and XML-based information that is very likely still useful in current product manuals).

OK, assuming the element type "foo" is declared in "doc.dtd" and that it's content model is consistent with the foo instance in EntZ.ent, what happens when you validate the document using your XML parser?

You get a "document is valid" message because the document is syntactically valid per the DTD.

Now create DocB that looks like this:
<!DOCTYPE doc SYSTEM "doc.dtd" [
<!ENTITY EntZ SYSTEM "EntZ.ent">
<doc id="x">
Note that it is using the same entity, EntZ.ent

Now validate DocB with your XML parser. What happens?

You get a "XML Error: duplicate ID 'x'". Doh!

What do you do? What can you do?

- You can change document DocB so it uses a different ID value.

- You can change EntZ so it uses a different ID value.

- You can make a copy of EntZ and then change it's ID.

All of these options should make it clear that you are doing use-by-copy. What if there are already existing references to the element with ID "x" by ID in DocB? You really don't want to change that ID, you'll break all those links. You don't want to change the ID in EntZ--what if DocA, which used it first, has existing links to the bar element by id "x"? And of course, if you copy EntZ, which is your only remaining choice, IT'S A COPY!. How much clearer could it be?

How is this a test for a repository? One test is, create the above system of documents, invalid as they are, and try to import them. It should fail. If it doesn't, you've probably got a problem.

Here's another test for repositories that claim to provide direct support for re-using arbitrary elements. Create DocA and DocC, which includes the foo element with the bar element with the ID "x". Now, using the repository's features for creating a re-use reference, include the element foo from DocC into DocA twice. What happens? If the repository allows you to create the second reference then it is implicitly claiming to do use-by-reference and not use-by-copy because a use-by-copy would create an invalid document and no repository would be stupid enough to allow that under its direct control, would it?

Now export DocA to the file system. What happens? How is the "bar" element used by DocA represented in the exported result? A typical result is that the repository synthesizes a new instance for export in which the used elements are copied into the new instance at the point of reference. This is obviously bad. Another typical result is that the repository synthesizes a new external parsed entity containing just the "bar" element. But this is obviously no better because it's still a copy and result document A is still invalid.

The correct possible answers include:

1. DocA and DocC are both exported and the reference from DocA to the "bar" element is exported as a link that can be resolved semantically in the context of the documents as exported.

2. DocA is exported and the reference from DocA to the "bar" element is exported using an address that can be directly resolved against the repository, where DocC exists in its authoritative form.

3. The export operation is not allowed because the result, expressed as entity references, would produce an invalid document.

Of these options 1 and 2 are preferable because it means the repository really is doing use by reference and not use by copy. Option 3 is the least desirable but at least it is useful and honest. Allowing export of documents that are not even syntactically valid XML (much less schema-valid) that the repository created is unforgivable. It's one thing for a repository to allow the storage or creation by authors of invalid documents--that's a policy decision made by the users of the repository. But for a repository to create, on its own, invalid documents is just a broken system that, at least in this case, reflects a fundamental architectural flaw.

At this point I would like to point out that I've been applying this test to SGML- and XML-aware content management systems for as long as such things have been commercially available. I distinctly remember being at one of the early but big SGML conferences, probably SGML '94 or '95, going up to the booth for one of, at the time, biggest and flashiest SGML-aware content management systems and asking the poor person manning booth: "If I include a paragraph with the ID 'x' in a document twice, what happens?" And her answer was "it allows it". And how is it exported? "As a single instance." At which point I said "thank you" and put that product on my "do not use" list. It is still on it. And in the meantime many other tools have been developed and successfully marketed that also fail this test. It both enrages and saddens me.

Another way to think about this is do you have a documents management system or a document management system?

If a system manages the XML data such that all the data that can participate in re-use must provide a flat ID namespace then it is, by the definition of XML syntax, able to manage at most one syntactic document (irrespective of how that document is organized into storage objects). This is even more obvious if all the XML data must conform to the same schema. [By contrast, only supporting a single schema but maintaining syntactic document boundaries in the repositories (that is, multiple elements with the same ID that can be addressed in the context of their containing documents) are documents management systems, although somewhat limited.]

If, on the other hand, it doesn't impose a flat ID namespace (because it does true use by reference) then it is capable of managing zero or more documents, which is usually what you want.

Another way to express this difference is: having imported a set of syntactically-distinct XML documents into a repository, can I address and export those documents as documents in terms of how they were known before import. If not, then you have a document management system. If yes, then you have a documents management system. If after import the elements are just part of an otherwise undifferentiated pool of elements, they are, by the definition of XML syntax, representing a single syntactic document and would be indistinguishable from taking the same elements and just syntactically concatenating them together into a single file and wrapping them in a single root element.

Of course, you will say, there are many enterprises that have operated well for years using document and not documents management systems. This is true, and I salute them and I feel their pain because while they may have made it work it cost them much more than it needed to in terms of money, pain, and imposed limitations on what they could do with their XML.

For example, I have a former client for which we developed a sophisticated schema for managing complex information about just the sort of modular component-based products mentioned above. We were explicitly not involved in developing their content management systems (we were only doing schema design, authoring support, and print and HTML rendition). We had gotten the schema well established and they were making good progress with implementing their XML-based authoring and production systems and we were just about to start expanding the schema to be more of an architecture that could support a wider variety of product types within the enterprise when we were told "um, our repository requires that we in fact limit schema changes as much as possible so you'll have to not make this more specialized (or even enable future local specialization) but make the schema more generic and less constraining so we don't have to modify it in the future because doing so would require lots of effort to update the repository (because the repository was tightly bound to the schema). Doh! This was predicted (by me) but nobody ever listens to me. This company continues to do work and do it well but my considered opinion is that they have made things much harder for themselves by their choice of repository and severely limited their ability to directly serve the needs of their authors and product groups by limiting their ability to refine and specialize their XML designs.

This particular view might well reflect my prejudice that the data is more important than how it happens to be managed at a given moment in time but I also acknowledge that this enterprise was (and is) trying to solve some very challenging problems of managing localized documents and component re-use at pretty large scales and that they were doing what they, in their considered opinion, was the right thing to do. They were not making an uninformed choice and it may well be that, at that moment in time, the approach they took was the most effective way to solve the problem and that the cost in lost flexibility was an acceptable cost. And it's not like it's an irrevocable decision--they can always choose to change their repository approach to one that is more flexible or not. But unless the system they've put in place fails utterly, which seems unlikely, it also seems unlikely that they'll abandon their significant investment in that system any time soon.

To sum up:

- Re-use is the use of given starting XML element in multiple contexts

- Re-use can either be by copy or by reference

- Use by copy is cheap to do but hard to manage

- Use by reference takes more effort (you've got to create a link and then resolve it when you process the data) but is directly managable

- Some technologies that may not appear to be doing use by copy at first glance, are in fact doing use by copy. This includes external parsed entities and repositories that copy data on export in order to reflect re-use references.

- There is an easy test, the "IDed element used twice" test that will quickly reveal whether a given system is doing use by reference or by copy.

In all this discussion there are some issues that I've alluded to or glossed over that I think we are now ready to start taking on head on. In particular, there is the fundamental issue of "addressing", that is, the mechanism by which you create and resolve pointers from one thing to another. All links use some form of addressing. If use-by-reference is linking then it follows that use-by-reference involves addressing.

The management and processing of addresses is one of the key bits of functionality involved in XML semantic processing and is a major source of the boundary complexity inherent in the task of getting XML data into and out of XML-aware repositories. At this point in the discussion you should at least be starting to get a hunch that there are some challenging processing issues lurking here and if you're really getting this (or you've been reading ahead) then you already have a pretty good idea of what those challenges are and how you might go about addressing them in a working system.

For next time: Almost certainly the first deep dive into boundary complexity, focusing on what you have to do to support use-by-reference



Blogger John Cowan said...

I think you're making an important distinction here, but I'm really not happy with the terminology "use by copy" vs. "use by reference". Entity references have some, but not all, of the characteristics you specify for use by reference; in particular, the first one.

Similarly, XInclude when processed in the standard way does not really meet your definition of use by reference: an XInclude processor makes the including element disappear, so it is no longer there in the result, and the relationship expressed does involve a storage object on the link target side, though it's possible to include just one element of that target.

XLink is much closer to true use by reference; XInclude was specifically separated from XLink in order to provide something more low-level, though not as low-level as entity references.

Only marginally relevant to this posting, but perhaps useful in spurring some ideas:

Harking back to David Durand's Palimpsest model of changes, he has four kinds of changes: the standard insert and delete, but also two others, dynamic copy and dynamic move. (Static copy is just a special case of insert, and static move is a special case (done atomically) of insert and delete, so they are not in the model).

If a paragraph is dynamically copied, for example, and then an insert change is processed against the text in its original location, the insert is done in both locations. In the case of dynamic move, the insert is of course only done at the destination of the move. See his thesis for lots more detail about how move vs. move and other interesting kinds of conflicts are resolved.

4:04 PM  
Blogger Eliot Kimber said...

You are correct that external parsed entities do exist in exactly one place, but that that doesn't really change the fact that they are, for all practical purposes, a form of copy, for all the reasons I give.

Of course you can impose local practices and constraints that will allow you to treat external parsed entities like objects but that doesn't make them objects.

Also, your analysis of XInclude as being like entity references as opposed to XLink is, I think, confusing one possible processing result (a new instance with the referenced data transcluded) with the core semantic, which is simply use-by-reference.

That is, there is no processing requirement that XIncludes be processed to create new literal instances and no obligation by any processor that they ever be resolved. For example, for the purposes of printing a compound document you might resolve the XIncludes in memory using the XSLT document() function. This does not create a new instance, it just makes the content of the target document available for processing.

If you hide your XInclude processing in your "parser", as is the case for Xerces and other XInclude-aware processors then the practical effect is pretty much like entities. But I normally recommend *against* that approach because you've lost the option of not resolving some or all of the references or applying your own policies and processes to them. For example, for one client I had to create a style sheet that maintained all the information about all the original XML input files in the printed result, sort of a "debug report" for the compound document. No way to do that if the XIncludes are already resolved by the time the data gets to the XSLT engine. And of course since I seldom, if ever, use XInclude directly for authoring it's not even an option as my use-by-reference links are not true standard XIncludes (even though they are semantically consistent with the XInclude spec).

Also, I don't agree with the (at least implied) ID handling rules in XInclude so again I can't use built-in XInclude processing. But more about that later (I did discuss this in my XML Europe paper on XInclude for authoring).

I'm not sure what you mean by "relationship expressed does involve a storage object on the link target side". All links, regardless of form of address or semantic, involve one or more storage objects on the target side because everything is stored in a storage object one way or another. For XML it is always the case: by definition XML elements are stored in storage objects, namely XML entities (including the document entity). So I don't how there's a useful distinction between XInclude and XLink in this case. The only fundamental difference between the two is syntax (of the link elements--the addressing syntax is the same for both), not semantics.

4:48 PM  

Post a Comment

<< Home