Dr. Macro's XML Rants: XCMTDMW: Import is Everything

Wednesday, July 26, 2006

XCMTDMW: Import is Everything

We've talked a lot about what an XML-aware CMS should look like and what it needs to do. Now it's time to put something into it.

So first a little map of the area we're about to explore. Where we are is a border region, the boundary between where your XML documents are now, the "outside world", and where we want them to be, the "repository". Separating these two is a high ridge of mountains that can only be crossed with the aid of experienced guides and, depending on the cargo you're carrying, more or less sophisticated transport. [Or, on a bad day, some sort of demilitarized zone fraught with hidden dangers and mine fields on all sides.]

If you're just bringing in files containing simple or opaque data with little useful internal structure or references to other files, a simple mule train will do the job. But if you're bringing in interconnected systems of files containing sophisticated data structures you're going to need the full logistical muscle of a FedEx or UPS, who can offer a range of services as part of their larger transportation operation.

The point I guess I'm trying to make is that as soon as you go from files that are individual isolated islands of data to files that connect to each other in important ways, you're going from simple to dangerously complex.

Most, if not all, data formats used for technical documentation use or can use interconnected files to create sophisticated systems of files. The most obvious case is documents that use graphics by reference or point to style sheets or that have navigation links to another document. Even PDFs, which we tend to think of as atomic units of document delivery can have navigable links to other PDFs (or to anything else you can point a URI at).

So any repository import mechanism needs to be able to work with systems of files as systems of files, however those systems might be expressed in the data. Even if you aren't doing any semantic management but only storage object management, it is still useful, for example, to be able to import all of the files involved in a single publication as a single atomic action.

I want to stress here that while XML as a data format standardizes and enables a number of ways to create systems of files, it is not in any way unique in creating systems of files to represent publications.

This suggests that a generalized content management system must have generic features for both representing the connections between files and using and capturing those connections on import. We've already established that the storage management layer (Layer 1 in my three-layer model) should provide a generic storage-object-to-storage-object dependency facility. It follows that our import facilities should provide some sort of generic dependency handling facility.

At this point I want to define a few terms that I will use in the rest of this discussion:

- publication. A single unit of publishing, as distinct from the myriad data objects that make up the publication. This would normally translate to "doc number" or "title" but in any case it is the smallest unit of data that is published as an atomic unit for delivery and consumption. It is usually the largest or next to largest unit of management in a publication workflow in that you're normally managing the creation of publications for the purpose of publishing them atomically at specific times. That is while some information is published piece-meal as topics that are dynamically organized the typical case is you're publishing books in paper or as single PDFs. That book or that PDF is the "publication". Thus a "publication" is a business object that can be clearly distinguished from all other publications, i.e., by its ISBN or doc catalog part number or whatever. While it is not required, it is often the case that publications are represented physically by the top-level or "master" file of their source data (in DITA terms, by a map or bookmap).

- compound document. A system of storage objects with a single root storage object linked together using some form of semantic link (i.e., XIncludes, topicrefs, conrefs, or whatever) in order to establish the direct content of a publication or similar unit of information organization or delivery. What exactly constitutes the members of a compound document is a matter of specific policy and document type semantics. For example, if you have both XIncludes and navigation links among several XML documents you would normally consider only the XIncludes for the purpose of defining the members of the compound document.

- resource. The an object that represents and provides access to all the versions of a single logical version. For example, if you import a file for the first time, that creates both a resource and a version, which points to the resource. The resource represents the file as a set of versions. If you then import a second version of the same file, it would point to the first version from which you could then navigate to the resource. Resources are objects with unique identifiers within the repository. From a resource you can get to any of its versions. Therefore the resource acts as a representation of the file independent of any of its versions. Resources are vitally important because they are the targets of dependency relationships held in the storage management layer.

- version. An invariant collection of metadata and, optionally, data, related to exactly one resource and to zero or more previous or next versions of the same resource. When you import files into the repository you are creating new versions. Once created versions do not change (you could, for example, implement your repository using write-only storage). The only possible exception to their invariance is version destruction--there are some use cases where it is necessary to be able to physically and irrevocably destroy versions (for example, document destruction rules for nuclear power plans in the U.S. or removal of draft bills from a legislative bill drafting system).

- repository. A bounded system that manages a set of resources, their versions, and the dependencies between versions and resources.

- storage object. A version that contains data as a set of bytes. A storage object provides methods for accessing its bytes.

- dependency. A typed relationship between a specific version and a resource reflecting a dependency of some sort between the version and the resource. The pointer to the resource includes a "resolution policy" which defines how to choose a specific version or versions of the resource. The default policy is "latest". Therefore, by default, a version-to-resource dependency is a link between a version and the latest visible version of the target resource. Dependency policies can also specify specific versions or more complex rules, such as rules that examine metadata values, storage object content, the phases of the moon, the user's heart rate, or whatever.

All of these terms except "publication" are from the SnapCM model http://www.innodata-isogen.com/knowledge_center/white_papers/snap_cm.pdf.

- bounded object set (BOS). The set of storage objects that are mutually dependent on each other, directly or indirectly. A compound document reflecting only XInclude links would form one BOS. If you also reflected any cross-storage object navigation links you would get a different (larger) BOS. BOSes are useful for defining units of import and export as atomic actions. A BOS is "bounded" in that it is finite. When constructing a BOS that includes navigation links you may need to define rules that let you stop including things at some point, otherwise you might attempt to include the entire Web in your BOS with is, for all practical purposes, unbounded. It is a set in that a given storage object occurs exactly once in the BOS, regardless of how many times it might be linked from various BOS members. The creation of a BOS requires that you be able to determine the identity of storage objects in some way, distinct from the mechanism by which they were addressed. That is, given two different URIs that you can resolve, you need to be able to determine that the resulting resources (in HTTP terms) are in fact the same resource. All file systems should provide this ability but not all storage systems can do this.

This is almost all there is to the SnapCM model. There's a bit more that I'll introduce as we need it. I should also point out that you should be able to map the SnapCM abstract model more or less directly to any existing versioning system. For example, with Subversion, there is a very direct correspondence between the SnapCM version, resource, and repository objects and Subversion constructs.

Therefore SnapCM can be valuable simply as a way to think about the basic operations and characteristics of systems of versions separated from distracting details of implementation. That thinking can then be applied to specific implementation approaches or existing systems. For example, you might have some crufty old content management system built up over the years with lots of specialized features, no clear code organization or component boundaries, and so on. By mapping what that system does to the SnapCM model you might be able to get a clearer picture of what your system does in order, for example, to separate, if only in your mind, those features that are really core content management features and those features that are business-object and business-logic specific (import, export, metadata reporting, UIs, etc.).

For the rest of this discussion I will only talk about XML compound documents, because that's our primary focus and they are clear. But I want to stress that the basic challenges of import apply to any form of non-trivial documentation representation, proprietary or standard, and the basic solutions are the same. A system built to handle XML compound documents should be able to be quickly adapted to managing Framemaker documents just by adding a bit of Frame-specific import functionality. Note my stress on the quickly.

Let's start small, just a single XML document instance governed by an XSD schema. Let us call it "doc_01.xml". We want to import it into the repository. This is the simplest possible case for our purposes as we can assume that you will not be authoring documentation for which you do not have a governing schema. There are other XML use cases in which schemas are not needed or are not relevant. This is not one of them.

So right away we have a system of at least two documents: the XML document instance and the XSD document that governs it. To import this system of documents I have to do the following:

1. Process the XML document semantically in order to discover any relationships it expresses via links in order to determine the members of the bounded object set we need to import. We have to import at least the minimum required BOS so that the state of the repository after import, with respect to the semantics of the links involved in the imported data, is internally consistent. That is, if DocA has a dependency on DocB that if not resolved prevents correct processing of DocA, then if you only import DocA and not DocB, the internal state of the repository will be inconsistent. Therefore you must import DocA and DocB as an atomic action in order to ensure repository consistency.

In this case we discover that doc_01.xml uses an xi:schemaLocation= attribute to point to "book.xsd". This establishes a dependency from doc_01.xml to book.xsd of the type "governed by" (the inverse relationship, "governs", while interesting, is not a dependency because a schema is not dependent on the documents it governs).

We don't find any other relevant links in doc_01.xml.

At this point, we have established that doc_01.xml is the root storage object of our compound document and the first member of the BOS to be imported. We know that book.xsd is rooted (for this compound document) at doc_01.xml and will be the second member of our BOS.

2. Process the compound document children of the root storage object, i.e., book.xsd. We determine that book.xsd has no import or include relationships to any other XSD documents (if it did we would of course add them to our BOS).

At this point we have established a BOS of two members reflecting a compound document of two storage objects.

3. For each member of the BOS, determine whether or not the repository already has a resource for which the BOS member should be a new version.

Hold the phone! How can I possibly know, in the general case, whether a given file is already represented in the repository?

The answer is: you can't. There is no general way to get this knowledge. There are a thousand ways you could do it.

One approach would be to use a CVS- or Subversion- style convention of creating local metadata (the "working copy") that correlates files on the file system to resources and versions in the repository. This is a perfectly good approach.

Another approach would be to use some sort of data matching heuristic to see if there are any versions in the repository that are a close match to what you're trying to import. There are systems that do something like this (I know some element-decomposition systems will normalize out elements with identical attributes and PCDATA content).

You can use filenames and organization to assert or imply correspondence (if a file with name X is in directory Y on the file system and in the repository then they're probably versions of the same resource). Of course this presumes that the repository's organizational facilities include something like directories. Not all do.

Another approach is to require the user to figure it out and tell the importer.

This last approach is the only really generalizable solution but it's not automatic. In the XIRUSS-T system I've generalized this in the import framework through the generic "storage-object-to-version map", which defines an explicit mapping between storage objects to be imported and the versions of which they are to be the next version, if any. How this map gets created is still use-case-specific. It could be via an automatic process using CVS-like local metadata, it could heuristic, it could be via a user interface that the importing human has to fill out. But regardless you have to have some way to say explicitly at import time what existing versions the things you are importing are related to.

OK, for this first import scenario we establish that in fact the repository is empty so there's no question that we will be creating new resources and versions for both doc_01.xml and book.xsd.

4. Having constructed our empty storage-object-to-version map, we execute the import process, the result of which is that we create two new resources, one for doc_01.xml and one for book.xsd, and for each resource, the corresponding version, being storage objects holding the sequence of types from doc_01.xml and book.xsd respectively. We also create a dependency instance from the version doc_01.xml (let us call this doc_01.xml version 1) to the resource for book.xsd.

The creation of these objects in the repository is an atomic transaction such that, as far as the repository is concerned, the resources, versions, and dependencies all came into existence at the same moment in time. This is very important--if the import activity is not atomic then it cannot be easily rolled back and the repository will likely be in an incomplete, inconsistent state for some period of time. This is an important difference between CVS and Subversion, for example. CVS does not have any reliable form of atomic commit of multiple files while Subversion does. Any repository that cannot do atomic commits as a single transaction that can be rolled back is seriously limited and should be given a very close look. I don't know if it's been corrected in the meantime, but in 1999, when we were using Documentum to store documents for a bill drafting system, we discovered that Documentum could not do atomic commits as single transactions. This was very distressing to us.

Let's look at the data we have in our repository. For example, doc_01.xml might look like this:

<?xml version="1.1"?>
<book xmlns="http://www.example.com/namespaces/book"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
     xsi:schemaLocation="../dtds/book/book.xsd"
>
  ...
</book>

Anyone notice the problem?

The problem is the value of the xsi:schemaLocation= attribute: it's a relative URI that reflects the location of the schema on the filesystem from which the documents were imported. But we're not in that domain any more. We've crossed the pass through the mountains and we're into a different country with different language and customs. That URI may or may not be resolvable in terms of the location of the data within the repository.

If you're using a system like Subversion where the documents are never processed directly from the repository but are always exported first to create working copies and those working copies will reflect the original relative locations then that's OK, because the repository is really just a holding area.

But what you really want is the ability to process the documents in the repository directly from the repository (e.g., as though the repository were itself a file system of some sort). You want this because it's expensive and inefficient to have to do an export every time you want to process a document because, for most documents in the domain of technical documentation, there will be a lot of files involved, some of them potentially quite large (i.e., graphics). It would be much easier if you could just access the data directly, e.g., via an HTTP GET without the need to first make a copy of everything.

But in order to do that all the pointers have to be rewritten to reflect the new locations of everything in the repository as stored.

This is non-trivial but it's not that hard either. You just need to know what the repository-specific method of referring to objects within the repository is and what the mapping is from the objects as imported (that is, in their original locations) and the objects as stored. The exact forms of the repository-specific pointers could take many different forms: object IDs, HTTP URLs, repository-specific URIs, or whatever. In today's world it generally makes most sense for the repository to use URLs so that you can use standard and ubiquitous HTTP services to access your repository contents.

For example, the XIRUSS-T system defines a simple HTTP convention whereby you can refer to a version either by naming its resource by resource object ID and, optionally, naming a resolution policy (the default is "latest visible version") or by version object ID. The XIRUSS-T system also defines some basic organizational structures that can also be used to construct unambiguous and persistent URLs and you can define arbitrary organizational containers (analogous to directories) by which you can also address objects. So in XIRUSS there are two base addressing methods (resource ID + resolution policy and version ID) that will always work and can be constructed knowing only the resource ID or version ID and other "convenience" forms that will also work.

So for our example, let us assume that book.xsd results in resource object RES0002 and version object VER0002. We can rewrite the xsi:schemaLocation= value in doc_01.xml like so:

<?xml version="1.1"?>
<book xmlns="http://www.example.com/namespaces/book"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
     xsi:schemaLocation="/repository/resource/RES0002"
>
  ...
</book>

This is still a relative URL (relative to the server that holds the repository) but it is now addressing book.xsd as a resource/policy pair that can be reliably resolved to the appropriate version at any moment in time.

This need to rewrite pointers is universal if you want to able to process storage objects as stored and you don't want to limit yourself to the static and limiting organizational facilities of a typical file system (which you don't, trust me).

Therefore, you need an import framework or mechanism that can do two things:

- For a given storage object to be imported, determine what its address will be within the repository after import. This could either be by asking the repository (e.g., resource = repository.createResource(); resource.getId()) or by applying some established convention or using metadata within the data to be imported (for example, you might have already assigned globably-unique identifiers to your documents, captured as attributes on the root element, and you use those identifiers as your within-repository object IDs).

- For each storage object, whatever its format, rewrite the pointers to reflect the new locations. It should go without saying that this process shouldn't break anything else. However, this is sometimes easier said than done. For example, the built-in XIRSS-T XML importer imposes some limitations on what XML constructs it can and can't preserve during import, mostly for practical reasons.

This suggests that repositories should, as a matter of practice, provide some sort of import framework that makes it easy as it can be (which isn't always that easy) to implement these operations. Any repository that provides only built-in importers or that does not make creating new importers particularly easy should get a very close look because it's likely both that any built-in importer won't do exactly what you want done or everything you need done (even if what it does do it does just how you want). If, for example, the import API is poorly documented or incomplete or, for example, it doesn't provide any way to get, set, or predict a resource's ID in advance of committing it to the repository, you've got a problem.

This is an area that a lot of enterprises don't check when evaluating potential XML-aware content management systems but it is a crucial area to evaluate because it is where you will be investing most of your integration and customization effort. The last thing you want to have to do is call Innodata Isogen to help you figure out how to get your stuff into and out of the tool you've already bought. Not that we're not happy to help but we'd rather not see you be in that position at all. We'd rather you hired us to quickly implement the exact functionality you need, cleanly and efficiently, rather than bang our heads against some product that resists all our efforts to bend it to our will. We like to have fun in our jobs too.

So our initial import process wasn't quite complete. We need to insert step 3.1 to include the pointer rewrite:

3.1 In temporary storage (or in the process of streaming the input bytes into the newly-created version objects) rewrite all pointers to reflect the locations of the target resources or versions as they will be within the repository.

In XIRUSS-T's import framework I have generic XML handling code that supports this rewrite activity and essentially acts as a filter between the input (from the file system) and the output (the new version objects) to do the rewriting. This generic XML handling code can then be used by schema-specific code that understands specific linking conventions. For example, there is an XInclude importer that recognizes xi:include elements and knows that it is the href= attribute that holds the pointer to be rewritten, an XSD schema importer that knows about schemaLocation, import, and include, and an XSLT importer that understands XSLT's import and include elements. You get the idea.

Notice here the separation of concerns, separating the generic operation of essentially changing attribute values in XML documents from the concern of schema-specific semantics. It's just basic object-oriented layering and abstraction, but it's really important and done correctly makes building importers so much easier.

OK, so now our repository data is completely consistent. I can access doc_01.xml directly from the repository (e.g., using an HTTP GET request to access the byte stream stored by the storage object VER0001 [the first version of resource RES0001, which is the resource representing all the versions in time of doc_01.xml]).

The structure of our repository looks like this:


/repository/resources/RES0001  - "doc_01.xml"; initial version: VER0001
                     /RES0002  - "book.xsd"; initial version: VER0002
           /versions/VER0001   - "doc_01.xml"; dependency: DEP0001
                    /VER0002   - "book.xsd"
           /dependencies/DEP0001 - Target: RES0002; policy: "latest"

We can now interrogate the repository and figure some things out. We can ask for the latest version of resource RES0001 and we'll get back version VER0001. We can ask for the list of all dependencies from VER0001 and we get back a list of one dependency, DEP0001. DEP0001 points to resource RES0002 with policy "latest" which resolves, at this point in time, to version VER0002.

Assuming we have an HTTP server on the front of our repository that lets us access all these objects via HTTP URL, we can do something like this:


$/home/ekimber> validate http://repositoryhost/repository/versions/VER0001

The validation application, using normal HTTP processing, will open VER0001 as a stream (as it would for any HTTP resource), read its bytes, see the xsi:schemaLocation= value, resolve that URL normally, get those bytes, process them as an XSD schema (which they are) and validate the document. It's just that easy.

You can do this today with the XIRUSS-T system. For example, with XIRUSS you can have a document, it's governing schema, and an XSLT all in the repository and all accessed and used directly from the repository via normal HTTP processing using most XSLT engines without modification. It just works.

However, we're not quite done yet. While the repository holds all our data and is internally consistent we haven't captured any metadata other than the original filenames of the files as imported (which is sort of a given but not necessary or even always desired--I've done it here mostly to make the repository structure clear).

So we need to add step 3.2 to extract and set the appropriate metadata. For this example, that will include:

- For book.xsd, the namespace it governs

- For each XML document, its XML version, all of the namespaces that it uses and its root element type.

- For each text file (that is, a file whose MIME type is some flavor of "text"), the character encoding used. All XML documents are also text files.

- For all storage objects, their MIME type.

- For each dependency, the dependency type (e.g. "governed by").

3.2 For each BOS member, identify the relevant metadata items and create each one as a metadata item on the appropriate newly-created repository object.

The structure of our repository now looks like this:


/repository/resources/RES0001  - name: "doc_01.xml"; initial version: VER0001
/repository/resources/RES0002  - name: "book.xsd"; initial version: VER0002
           /versions/VER0001   - name: "doc_01.xml"; Resource: RES0001
                                 dependency: DEP0001
                                 namespaces: http://www.example.com/namespaces/book
                                 root element type: "book"
                                 mime type: application/xml
                                 xml version: 1.1
                                 encoding: UTF-8
                    /VER0002   - name: "book.xsd"; Resource: RES0002
                                 root element type: "schema"
                                 namespaces: http://www.w3.org/2001/XMLSchema
                                 target namespace: 
http://www.example.com/namespaces/book
                                 mime type: application/xml
                                 xml version: 1.0
                                 encoding: UTF-16
           /dependencies/DEP0001 - Target: RES0002; policy: "latest"
                                   Dependency type: "governed by"

Now we're getting somewhere. We can do a lot more interesting things with this information. For example, we can ask the question "what schema governs namespace 'http://www.example.com/namespaces/book'?". Or "what documents are governed by the schema for namespace 'http://www.example.com/namespaces/book'?" Or "what documents have the root element 'book'?" Or "give me all the XML documents". Or "give me all the XML documents that are not XSD schemas".

You get the idea.

Even if you have little implementation experience it should be fairly obvious that a basic implementation for these queries would be quite easy to implement--you just look at all the objects, examine their metadata values and match them to the query terms. Of course for anything real you'd probably use a proper database to index and optimize access to the metadata and there's no reason a normal SQL database wouldn't work perfectly well for that (even, perhaps, Gadfly). But the brute force solution is pretty simple yet yields amazing power.

The import process I've outlined here is pretty much the minimum you need to do for XML to get consistent and correct data and useful metadata. The XIRUSS-T system provides a sample implementation of exactly this process with support for a variety of standard XML applications (XSD, XSLT, DITA (albeit old IBM DITA), and XInclude).

If you are thinking about these operations in terms of your own XML, which if you're using XML for technical documentation today is probably pretty sophisticated, you are probably realizing that there's a lot more that you either need to do or could do in terms of capturing important dependencies and storage-object metadata.

Note too that we haven't said anything about Layer 2 metadata, that is, metadata that applies directly to elements. The closest we've come is to capture the root tag name of each XML document, which is just a reflection of the fact that it's a common query that's easy to support at this level so there's no reason not to capture it. [It helps particularly in supporting the common use case of doing document-level use-by-reference where you point to an entire document in order to use its root element by reference. In that case, if you have captured the root element type and governing schema you can implement reference constraints without having to look inside the document. That's a significant savings for a very common and reasonable re-use constraint {despite the fact that link-based use-by-reference enables using elements that are not document roots limiting yourself to document roots makes a lot of things simpler, especially in terms of author support user interfaces--it's much easier to present a list of files or documents then to present a list of elements drawn from who knows where. So if you can live with the constraint it's not a bad one to at least start with.}].

Because with the repository as shown we can both distinguish XML from non-XML storage objects and because we can resolve the XML-level storage-object-to-storage-object relationships using normal XML processing tools, we can start by implementing all our Layer 2 and Layer 3 functionality in a completely ad-hoc way using brute force or, for example, by maintaining the necessary optimization indexes and databases totally separate from the respository. With just what I've shown here it should be obvious that there's enough information and functionality to for example, write a simple process that gets each XML document and looks inside it in order to do full-text indexing or link indexing or whatever. This requires nothing more that the ability to send HTTP requests to the repository server (let us assume that you can use URLs to get specific metadata values or sets of metadata values).

From this it should be clear that any system that starts by indexing all the XML just as a cost of entry (meaning cost of initial implementation as well as cost of use) is so obviously doing premature optimization that it's not even funny.

To sum up:

- The boundary between the outside world and the repository is the Rubicon that you have to cross to get your data into the repository.

- For documentation formats, XML or otherwise, doing import requires the following basic operations:

1. Determining what the bounded set of storage objects is that need to be imported so that the result in the repository is complete and correct.

2. Rewriting any pointers in the imported data so that the imported result points to the targets as imported.

3. Extracting or gathering any storage object metadata and bind that metadata to the newly-created repository objects.

4. Instantiating the repository objects, including resources, versions, and dependencies, with their attendant metadata and, for storage objects, data content (as a sequence of bytes)

- For a given bounded object set to be imported, the import operation should be a single atomic transaction that can be rolled back (undone) as a single action. This ensures that the repository is always in a consistent state, even if the import processing fails midway.

- Some of the import processing can be generic (rewriting XML attribute values) but most of it will be schema specific (understanding how XSD schemas are related to other documents, understanding the linking syntax and semantics in your private document type). In a layered system you can build up from general to specific, taking advantage of relevant standards, to make creating schema-specific import processing easier to create and maintain.

- The existence of standards like XSD, XInclude and DITA makes it possible to build in quite a bit of very useful generic import functionality for those standards, which even if that's all you have, gives you a pretty good starting point.

- You can still get a lot of mileage out of just Layer 1 metadata, as demonstrated by the scenario walked through in this post. We haven't done anything to capture Layer 2 metadata yet we can already answer important questions about our documents as XML just through the simple metadata values we've captured.

- Note too that the repository itself, that is the Layer 1 structures we've seen so far, knows absolutely nothing about XML itself. The metadata mechanism is completely generic name/value pairs where you, the importer, specify both the name and the value. This is why something like Subversion is an excellent candidate to build your Layer 1 system on.

That is, all the XML awareness is in the importer and in the queries applied against the repository, not in the repository itself. That's one reason I chafe at the term "XML repository"--it strongly suggests over engineering and poor separation of concerns from the getgo.

- The correlation of files to be imported with existing resources and versions in the repository cannot be done automatically in the general case. You must define or provide one more ways to do it, either automatically using some convention (CVS) or interactively through human intervention. A completely generic repository should leave the choice up to the implementor and integrator. XIRUSS-T does this through its generic system-object-to-version map.

There's lots more to discuss before we've even covered the basics of XML import but that will have to wait until next time.

Next time: importing the next version of doc_01.xml: all heck breaks loose

Labels: XCMTDMW "xml content management" import snapcm

5 Comments:

Anonymous said...: OK. A few questions.

The first one totally shows me for the noob I am to XML. Please remember I learned on my own under project duress less than 3 years ago. I just happened to realize it's amazing along the way.

1) I am missing a subtletly in the terminology of 'semantic' and 'syntactic'. I understand their definition (semantec-determining the meaning of, syntactic-following the rules of the language or structure). But I become confused when trying to separate the need for the differences. Does 'processing XML semantically' mean determining what it might be being used for, or how it's related references apply to it? I am missing something subtle... can you help?

2) When you are re-writing relative links within an XML document during an import, why do you not create another version of it? And, when doing it, how do you retain the original relative reference if you check-out or export it?

3) This may be jumping ahead (or back): What structure tells "doc_01.xml" that it needs a different version (other than latest) of "book.xsd"? For example, if you're creating a document assembly referring to document and graphic objects of varying versions, is that version identification controlled by attributes in the "doc_01.xml" file or by the resource or the BOS? This is an area that confuses me a lot. On one hand, it seems like you don't want to explicitly state versions within your source XML, but instead by using some kind of wrapper code that functions as an 'assembly' document showing the list of related versioned assets at any given instance. Now I'm confusing myself further.

Thanks!; 10:09 AM
Eliot Kimber said...: These are all good questions. The last one is explicitly answered in the next post I'm writing.

On the issue of syntatic vs semantic, I understand your confusion--it's something I struggle with communicating clearly. It doesn't help that a lot of XML processors don't make a clear distinction either.

The essential difference is, in XML terms, the difference between what the parser does and what happens after parsing.

If you look for example, an an XSLT process applied to an XML document, there are two stages:

Stage 1. Parse the bytes in the input file to distinguish markup from content and produce some sort of in-memory representation of the resulting document tree, i.e., a DOM. This is syntactic processing and is 100% determined by the rules of the XML specification.

Stage 2. Using that DOM (or series of SAX events or whatever), apply the semantics of XSLT to the document to produce some output.

This is semantic processing applied to the abstraction that the XML represented. The processing is entirely determined by the application, in this case the XSLT specification the writer of the XSLT script and is not, for most purposes constrained by the syntax of XML. That is, the XSLT engine doesn't care whether or not you used external parsed entities or which namespace prefix you chose or how you put whitespace inside your markup: those are all syntax things that have no bearing on the true meaning expressed by the markup.

One key difference is the degree of choice you have about what to do: with syntactic processing there is usually only one right thing to do: either a left angle bracket starts a tag or it doesn't. With semantic processing there is an infinity of choice: given an element (not a string, but an element object in memory), you can do anything--your only bounds are those you impose on yourself, for example, by agreeing to implement a standard set of semantics or by cleaving to a corporate business rule.

A quintessential example is the difference between DOCTYPEs and schemas.

A DOCTYPE, represented by a DOCTYPE declaration, is a syntactic component of an XML document. You can choose to process it or not but if you do process it you are doing it as part of the initial parse. In addition, because it is a syntactic part of the document it can be completely contained within the document instance and the validation result will be the same. In the absence of any local, non-standard-defined rules imposed by tools or human enforcers, an XML document's author has complete freedom to create or change the DOCTYPE declaration as they please.

By contrast an XML schema is not a syntactic part of the document. It is a physically and syntactically distinct document entity. It is associated with a document through an explicit or implicit link. The validation of a document against a schema is not a syntactic operation because it's not something a parser can do (because it's not an operation defined by the XML specification). Rather, it's a semantic process applied after the initial parse. The confusion can come when these two phases are packaged together in a single step called a "parser" (i.e., the Xerces parser). But the fact that you can use any underlying XML parser and still get schema validation should be a clue.

Note too that because they are physically and syntactically distinct, authors of documents cannot unilaterally change the schema itself. They can change the schemaLocation= attribute (if there is one) but they can't change the schema just because they can change their document.

For your question 2, I'm not sure I understand what you're asking but I suspect it's answer in the post I'm about to make.; 10:38 AM
Anonymous said...: That makes a whole lot more sense. The XSLT example gave me a good framework to work from. As you stated, I used the term "parser" in my head to mean syntactic validator as well as semantic applicator. Thanks...; 9:49 AM
John Cowan said...: A couple of miscellaneous points in no particular order:

1) Using jing and trang, you can treat an external DTD subset as a nonsyntactic (I'd rather say non-embedded) validation spec, just like an XML Schema or other modern schema type. This is often useful when you are still dealing with DTDs for one reason or another -- don't include them with DOCTYPE declarations but maintain them externally.

2) If you are validating all your documents, there's no reason not to validate your schemas too. They are XML documents and are described by an XML Schema: the Schema for Schemas.

3) What your ruminations about possible multiple XML Schemas for a namespace shows is that using the namespace name as the identifier for an XML Schema is a bad idea. Schemas should have their own identifiers. For RELAX NG, of course, this is a must.

4) Another fundamental difference between Subversion and CVS is that a version number in Subversion represents a version of the repository, not of a particular file. Every time you commit a change, the repository version number increases, and every time you update a working copy, all the files in that working copy are upgraded to the latest version of the repository. This being so, you may want to think twice about making version identifiers fundamental in your model.

It's quite common for a file managed by svn to be unchanged in versions 1-10 of the repository, then change in version 11 and persist till version 32, then change in versions 33, 34, and 35, then ...

5) Disk being cheap these days, it's not necessarily a Bad Thing to keep a fully checked out, read-only copy of your repository around for processing purposes. Dynamically fetching files from the repo isn't necessarily faster, cleaner, or better. Using WebDAV makes it look like your svn repository is just a filesystem, but it's far from clear that that's better than a read-only copy made with svn export.

6) I think it would be better to talk about making URI paths repository-relative rather than repository-specific when importing. In that way, they will work if the tree is checked out or exported.; 3:21 PM
Eliot Kimber said...: In response to John's comments:

1. Using jing and trang, you can treat an external DTD subset as a nonsyntactic.

Note that I'm not ragging on DTD syntax vs schema syntax, only on syntactic (DOCTYPE) vs semantic (schemaLocation or namespace matching).

2. If you are validating all your documents, there's no reason not to validate your schemas too.

Good point. I guess I sort of see the validation process as validating the schemas as a side effect but it's perfectly reasonable to validate everything. Or nothing.

3. Schemas should have their own identifiers.

Yes--there's a missing thing in the XML universe, which is a standard way to name XML applications as unique objects, distinct from their various implementing artifacts, including XSD schemas. We tend to pretend that namespaces are application names but they're not. I think this is on my list of subjects to blog about at some point.

4) I just moved the XIRUSS-T source code from CVS to Subversion on SourceForge, so I'm starting to understand SVN's way of naming versions.

But note that my version IDs are not version numbers in the sense that either CVS or Subversion mean--they're just object IDs of objects that happen to be Versions in the SnapCM model. You could use any form of display or semantic version naming or numbering you wanted, including CVS-style per-resource version numbers or Subversion respository state change counters or something that's a closer match to your business process workflows (for example, reflecting draft versions).

I haven't gotten to it yet but SnapCM enables arbitrarily sophisticated ways to organize sets of resources and their versions.

5) I agree that for work patterns where you create a working copy and then leave it mostly invariant for a long period of time, copying the whole thing usually makes sense (except when your old laptop with the 40 gig drive only has 2 gig left and some chump has committed 2.5 gig of graphics into the repository....)

I'm thinking more of the case where you want to do processing on a small subset of the repository or where the copies would not be persistent, such as when you want to render one publication out of thousands in the repository. In that case you probably need to be more clever about what you copy and what you don't. For example, you might keep a cached copy of the graphics local to the processor but get the XML source dynamically. But it's good to have the choice.

6) I think it would be better to talk about making URI paths repository-relative rather than repository-specific when importing. In that way, they will work if the tree is checked out or exported.

I think you'll see why this is an unnecessary restriction when you get to Part 4 of import is everything. The repository itself has no necessary structure so, in the general case, it's not meaningful to have "repository relative" URIs. In SnapCM, any organization constructs are purely arbitrary and user created. In XIRUSS I have a version type called "organizer" that is a generalization of directories and with that you can construct any sort of containing hierarchy you want and you can resolve URLs against that hierarchy. But those structures are not necessarily persistent so the only reliable address is one that refers directly to resources or versions by their object IDs.; 3:42 PM

Dr. Macro's XML Rants

Wednesday, July 26, 2006

XCMTDMW: Import is Everything

5 Comments:

About Me

Previous Posts