Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Saturday, July 01, 2006

Namespaces, Tables, and Schemas, Oh My

We're finally settled into our new house and into a more or less consistent daily routine which means I should now be able to attend to my rants a bit more. The dogs are doing their part by waking me up around 5:30 a.m. whether I want to or not. That gives me a few quiet hours in the morning that I should spend doing productive things like this blog. Thanks dogs. [And serious props to Tim Bray, who is a father for the second time yet still finds time to post to his blog. And of course props to Tim's wife Lauren for doing all the hard work. It makes me feel just a little whiny for even alluding to something as trivial as being awakened by my dogs--at least my daughter has a regular (and long) sleep schedule.]

Anyway, the rant of the day is not really a rant as a conundrum: what's the best way to enable recognition of standard XML types that are intended to contain arbitrary stuff from non-standard namespaces such that the schemas governing the non-standard stuff can constrain the rules for what goes inside the standard stuff?

This question comes from my attempts to integrate standard table models, in particular the OASIS (nee CALS) table model, into purpose-built document types that are, per my rule that all document types should be in a namespace, in their own namespace and that whose constraints are formally defined using XSD schemas.

For example, the OASIS Exchange Table Model (which supplants the hoary CALS table model used in most technical documentation doc types) does not define a namespace (the specification was published in 1999, before namespaces were even finalized or in common use).

This means that one can, like DocBook, simply add the table element types to your schema and go on your way. But that is asserting that those element types are part of your schema, not a standard module that you are using by reference. In the bad old DTD-only days this was the only thing you could do because there was no concept of namespaces and no good way to distinguish your names from somebody else's names (except through the use of SGML Architectures as defined in ISO/IEC 10744, but very few people stepped up to that level of sophistication). The fact that "standard" DTDs like OASIS tables and DocBook had you customize them by modifying parameter entities to add your own types or otherwise modify the syntactic definition of the structures should be strong evidence that these were not reusable objects in any useful sense but merely templates that one could use with some hope of getting consistent behavior from tools.

But with schemas you have the ability to create truly modular schema components that can be used by reference. This is because, unlike DTDs, schemas are not syntactic components of the documents they govern but separate objects. While schemas define syntactic constraints on documents they are themselves not part of the governed documents' syntax. That is, because DTDs are part of a document's syntax, they are always processed by the XML parser. By contrast, schemas are not part of the document syntax and are processed semantically following the initial XML parse (the rules they impose may be validated by the parser by that will only be after doing the initial parsing that is defined purely by the XML spec itself).

This is an important aspect of XSD schemas [and any similar constraint specification mechanism--I focus on XSD schemas because they are the W3C standard for XML constraint specifications and the most widely supported of the non-DTD constraint mechanisms]. Because they are processed semantically schemas can, and do, provide mechanisms for having truly modular schemas. This makes it possible to combine element types from different namespaces into a single document type without having to literally copy the declarations or, necessarily, bring those declarations into your namespace.

The use of namespaces also addresses another inherent problem in the old, DTD-based, non-namespaced, CALS and DocBook way of doing things: how do you know, unambiguously, that a given element is in fact a CALS table or a DocBook document? You can't. In the absence of some application-specific identifier there is nothing that unambiguously identifies a given element as being part of a CALS table (or a DocBook document). You have some strong hints, like the name "tgroup" for a container of rows in CALS (and OASIS) tables, but that's not 100% reliable. The external identifier of the DTD isn't reliable because it's purely arbitrary and even when it's a public ID or fully-qualified URL doesn't guarantee anything about what's declared in the file at the other end. Consider this perfectly valid XML document:


<?xml version="1.0"?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V4.1//EN" "http://www.example.com/mydoctypes/foo.dtd">
<book><not-a-docbook-book/></book>


Where foo.dtd is:


<!ELEMENT book (not-a-docbook-book)>
<!ELEMENT not-a-docbook-book EMPTY >


In this case the use of the DocBook public ID is, one might argue, a lie, but the result is perfectly valid (it doesn't help that the precedence of PUBLIC and SYSTEM IDs is not universally defined, so depending on how you do entity resolution, you might actually get the DocBook 4.1 DTD or you might get foo.dtd).

Hopefully you get the idea (if you're not getting the idea, then you have deep SGML brain damage and we need to talk).

Of course in practice people made this fundamentally broken system (that is, the DTD mechanism for defining document constraints, not its use by CALS or DocBook) by respecting the conventions that were established (such as that "tgroup" means a CALS table so don't use it for something else) and, for the most part, people didn't do the sort of thing I just did above (except when they did, often quite innocently--ask yourself this question: can you prove that any two documents that claim to be "DocBook" documents are in fact interoperable or are reliably processible by a generic DocBook processor? The answer is no, you cannot--the reason should be obvious but clearly it is not to some people. I'll have to save that discussion for another rant.)

So in the context of a particular processing system built to handle a specific set of documents with a known document type created by the same people who built the processing system, there's no practical problem: everyone knows what they're doing and supposed to do and they just make sure the right stuff happens. Good enough.

The real problem is faced by generic systems, systems that need to be able to reliably and correctly handle documents they've never seen before purely in terms of the standards those documents claim to conform to.

A good example is the generic XML editor, such as Arbortext Editor or XMetal. OASIS tables are a published standard. Table editing is a useful and distinguishing feature of XML editors. Therefore they provide built-in table editing features. But how does Arbortext Editor, for example, know for sure that the element called "tgroup" in your document is in fact an OASIS (or CALS) table? Without more information it can't.

In the case of Arbortext Editor they actually use a heuristic to make an educated guess: if the element type is named "tgroup" and it has attributes thus and such and it has subelements named this and that then it's probably a CALS/OASIS table. But that's pretty weak. I know about this because I had created a DTD that had an element called "tgroup" that I intended to be a CALS table but had heavily modified to suite the needs of the particular client, which wanted, if memory serves, to severely limit what authors could do with tables. When I opened one of these documents in the editor, no table editor, which normally just shows up. WTF? Turns out I had removed a key indicator of CALS tableness. Doh.

But wouldn't it be much better if there was a simple and completely unambiguous indicator that that "tgroup" element was in fact a conforming CALS/OASIS tgroup? Absolutely.

Do we have a mechanism for doing that? Yes we do: namespaces.

Fine you say, finally, you've gotten around to namespaces.

I'm trying to make it clear that unnamespaced elements cannot constitute reusable document type modules in any reliable way if you have any expectation of reliable recognition and processing by generic processors. If I haven't made that point clear by now, let me know and I'll try again.

So we should be clear that for something like a table module to work as a module that then enables reliable generic processing it must be unambiguously identified in some way.

How can that be done?

The obvious way is simply to put the table module into its own namespace. Another way is to do what we've done in DITA 1.1 and put a single attribute into a namespace.

I like namespaces (see my previous rant about how I was originally wrong about namespaces). Namespaces make things clear to everyone.

However the body of practice with using namespaces to create compound document types composed from multiple modules intended to be mixed and matched is quite thin and I don't think we've yet arrived at a concensus of what the best practice is. So I've been experiementing in the context of document types for technical documentation where you want to create a family of related document types that share some common structures, use appropriate standard components such as MathML, SVG, and OASIS tables, and are practical to author. In this context tables will always be essential and they will almost always be OASIS tables (at least until Arbortext and XMetal provide built-in support for graphically authoring XSL-FO tables).

My current working hypothesis is that each distinct set of re-usable element types should be in its own namespace. This follows in part from my assertion that (with a few small exceptions) every XSD document should govern a distinct namespace (and conversely, every namespace should be represented by exactly one XSD schema document in a given processing context).

For example, say you have two abstract document types that you know will share a lot of element types in common, for example, User Guides and Service Manuals. These are two different applications and therefore should be two different namespaces with two different top-level schemas. However the low-level stuff like paragraphs and figures and whatnot are going to be the same.

In that case I think you should have a third namespace that governs the common stuff. This makes the distinctions between the components involved clear and maintains the one-to-one XSD-to-namespace mapping. [Let's ignore authoring issues around namespace declaration for now: it turns out to not really be a problem in practice but I don't want to go into it now. For now you'll have to trust me or do your own experiments.]

Given this approach it follows that the table module should also be in its own namespace. This reflects both the fact that it is in fact a separate module and also that its defined by some other entity, i.e. OASIS and that I don't own it or haven't copied it to make my own derivative thing.

However, having done that in an XSD schema, I immediately ran into this problem: how to define the content model of the entry element so that it only allowed those things that I want to allow within table cells?

There is no normative schema for OASIS exchange tables as far as I know (I couldn't find one navigating around on OASIS site and I only found the spec by googling for it--I didn't find any links to it on the OASIS site (which is where it lives: http://www.oasis-open.org/specs/tm9901.html). However, if you apply a DTD-to-schema converter you get something like this for the entry element:


<xs:element name="entry">
<xs:complexType>
<xs:complexContent>
<xs:extension base="btd:tbl.entry.mdl">
<xs:attributeGroup ref="btd:attlist-entry"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
</xs:element>

<xs:complexType name="tbl.entry.mdl" mixed="true">
<xs:group minOccurs="0" maxOccurs="unbounded" ref="btd:paracon"/>
</xs:complexType>


This reflects the use of DTD parameter entities to make the base declarations customizable. This of course won't work in XSD schema (because there's no analog of parameter entities). Note that all the ref= values are namespace qualified because, as is my practice, the containing schema governs a namespace.

The example above is the result of directly including the standard OASIS declarations in a DTD and then schemafying it. But of course I want the table schema to be in its own namespace, which will result in something like this:


<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://example.com/namespaces/oasis/table_exchange_model"
xmlns:tbl="http://example.com/namespaces/oasis/table_exchange_model"
elementFormDefault="qualified"
>
<!-- ... -->
<xs:element name="entry">
<xs:complexType>
<xs:choice>
<xs:any processContents="lax" namespace="##other" maxOccurs="unbounded"/>
</xs:choice>
</xs:complexType>
</xs:element>


Note that the entry element has to allow "any" as its content. The best XSD can do at this point is allow you to prevent elements in the table namespace from occuring in entry, but it doesn't help you constrain your own elements. At this point you could put any element from your private namespace into an entry. This is probably not what you want, certainly not in an authoring document type.

What to do?

One thought I had was to use substitution groups such that you could define, in your namespace, an element that could be substituted for the base entry element. However this won't work because the rules for substitution groups require that the contents of the substituting elements be restrictions or extensions of the substituted element type. However this doesn't work (at least Oxygen XML's schema validator reports this as invalid:


<xs:element name="mytablecell" substitutionGroup="btd:entry">
<xs:complexType>
<xs:group ref="btd:paracon" maxOccurs="unbounded"/>
</xs:complexType>
</xs:element>


This is either a design bug in XSD or an unavoidable consequence of some essential aspect of the design (I haven't dug into the issue enough to know) although I think it's a design bug. At a minimum if the head element's content model is "any" and it repeats, you ought to be able to substitute any element type with any content model in that case (as long as the min and max occurs rules are consistent).

So Doh! This won't work. That doesn't seem to leave many good options (and even it if did work it still doesn't prevent an author from using the base entry element instead of the "mytablecell" element--that is, substitution groups effectively extend the contexts in which the head elements occur, they don't replace them).

Have I missed something? I don't think I have.

So what does that leave?

Either we can do what we did before namespaces and just copy the table element types into our own namespace and modify them however we want or we can use the namespaced attribute approch. The copying approach will work just as well as it did before but doesn't satisfy my desire to have a general way to unambiguously identify (and by implication, validate) tables (or any other similar module where you want to mix in your own element types).

That leaves the namespaced attribute.

In this approach, you copy the module's declarations into your schema but you declare an attribute that is in a module-specific namespace. For example, to identify OASIS exchange tables you could have this schema document:


<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" attributeFormDefault="qualified"
targetNamespace="http://www.example.com/namespaces/oasis/exchange-table-model"
>

<xs:attribute name="specVersion" type="xs:string" default="1.0"/>

</xs:schema>


And then use it like so:



<xs:import namespace="http://www.example.com/namespaces/oasis/exchange-table-model"
schemaLocation="../oasis_exchange_table_model/oasis_exchange_table_model_1.0.xsd"/>

...

<xs:element name="entry">
<xs:complexType>
<xs:choice>
<xs:any processContents="lax" namespace="##other" maxOccurs="unbounded"/>
</xs:choice>
<xs:attribute ref="tbl:specVersion"/>
</xs:complexType>
</xs:element>


This now allows any OASIS-exchange-table-aware processor to first look to see if the namespace is used anywhere and then when it finds an element whose base name is one it recognizes (i.e., "tgroup") it can assume with confidence that it is in fact an OASIS table tgroup and not something else.

This mechanism has some weaknesses: it still relies on the matching of local names, which means that you can't, for example, rename elements to something else (which you could do, for example, if you could simply subtype the base table types, which you can't because the same content model constraints apply there as apply to substitution groups). And it's not 100% foolproof. But it does at least provide a way to clearly assert, in the schema, that this element that is called "tgroup" is really intended to be an OASIS table tgroup and not something else. That is sufficient for a tool like an editor or content management system or generic formatter to apply OASIS table semantics to that element with reasonable certainty that it's the right thing to do.

Another approach would be to take the architecture approach, as used in HyTime and in DITA, where you use attributes to declare the base type of the element, i.e., something like this:


<tgroup tbl:oasisExchangeModel="tgroup">


This is the approach that XLink takes as well.

The problems with this approach include:

- It requires the ability to handle defaulted attributes if you want to avoid having all these attributes explicit in the source. This requires either the use of DTDs or schema-aware processing (for example, to feed an XSLT transform with post-schema-validation information that includes the defaulted attributes). DTDs are bad and you shouldn't use them. Setting up schema-aware processing isn't too hard if you can do a few lines of Java programming (or crib some existing code) but is something that has to be set up.

- There is no standard for identifying the attributes themselves as being mapping declaration attributes. In HyTime we had special processing instructions that declared what architectures were in use and what attributes were used to do the mapping. In DITA they define a specific attribute name and various convoluted (and in my opinion, misguided and unnecessary) rules for how to construct your DTD declarations. For anything else it would be an application-specific convention (as it is in XLink, for example).

- It shouldn't be necessary given an appropriate typing mechanism in your schema mechanism. Unfortunately the current XSD mechanism is too constraining. My understanding is that this is being addressed in a revision but I don't really know if it will be addressed completely. I think that any general solution has to give the schema author control over the nature of the constraints imposed on subtypes--certainly the current XSD-defined constraints are too strict for their use here (and for their use in implementing the equivalent of DITA's subtyping semantics).

Of course for any of this to have meaning it would be necessary for the OASIS exchange table model specification to be updated to do the following:

- Define the namespace that means "OASIS exchange table model"

- Define a normative schema template for use in other schemas (it can't be true model for all the reasons explained here, at least not today).

- Define one or more attributes that are in the OASIS exchange table model namespace in order to allow automatic recognition.

Note that for FO tables you have a similar issue in that there's no good way to extend the normative content models and since they are already in a namespace you couldn't just copy them into your schema's namespace. You'd have to create copies of the FO elements in your namespace and then map them back to the actual FO elements when you formatted the table. Not a big deal but really it shouldn't be necessary. Hmph.

Finally, note that all of this isn't an issue if there is no need to allow your namespace's elements within the context of standard stuff, such as when using MathML or SVG, where once you're in their domain you don't leave it. In that case you just import those schemas, allow the top-level elements (or subtype them in your namespace) and go on.

It's only where you need to intermix elements from your namespace into elements from a used-by-reference namespace and you want to impose constraints on that intermixing at the XSD schema level so that authors will get the appropriate guidance from the editor.

And of course there are lots of other ways to impose the constraints but they all require customization which requires configuration and programming which is expensive.

It's also clear that the needs of documentation schemas, as opposed to more data-oriented schemas, were not well represented and/or well reflected in the development of the XSD schema spec. I suppose I could partly blame myself for that: I could have participated in that effort but at the time it seemed doomed to fail (I think I'm on record as predicting that they would never actually produce a working spec). But I didn't and they did produce a spec and here we are. At the time I thought they were both making it way more complicated than it needed to be and I thought it was overburdened with too many cooks representing too much the database world. The final result in fact seems to be just about as complicated as it needs to be, although there are certainly more things it needs to do. But just the having of it is a tremendous benefit to the XML community so I'm inclined to not complain too much about its flaws, although I would like to see this one addressed if possible.

Labels:

7 Comments:

Anonymous Anonymous said...

If you want substitutions to work you need to derive the mytablecell element from the btd:entry element, in this case using xs:restriction.

And if you don't want people to be able to use the btd:entry element, you have to make it abstract.

5:15 AM  
Blogger Eliot Kimber said...

I had forgotten about abstract elements. That could address part of the problem. But I don't think there's any way to do the derivation because you simply can't restrict from <xs:any/> to a sequence or choice group, which is what I need to do.

Also, now that I think about it, doing substitution at the "entry" level wouldn't work for existing processors that expect to see an element named "entry" (that is, processors that are not schema type aware and that therefore cannot see that MyTableCell is a subtype of "entry").

That means I'd have to add an element inside of entry, i.e., "entry_contents" that would be the substitution point. This would allow existing processors to do the right thing but would add another level of containment inside the cell (albeit one that most XML editors would insert automatically).

7:08 AM  
Blogger Eliot Kimber said...

This comment has been removed by a blog administrator.

7:08 AM  
Anonymous Anonymous said...

As far as I understand the spec it should be possible to restrict from xs:any to a choice group. But the Microsoft XSD parser doesn't allow it either.

1:16 PM  
Anonymous Anonymous said...

I don't think you need to define a content model for the table cell at all. If the table schema is supposed to be a generic reusable component, it seems logical that you need configure the table for use in you own schema.

If you define tbl:entry to be of type: <xs:complexType name="entryType"/> you can extend it. To use the table, you create a new schema in the table namespace which redefines entryType by extension to add your own content model. Any defined attributes in the generic table model will be preserved.

5:29 AM  
Blogger Eliot Kimber said...

Extension is the key here: for some reason I didn't think of that.

If I give the type "entry" no content model then of course it can be extended in any way. Making "mytablecell" an extension of "entry" (or rather its type) and then putting it in the "entry" substitution group solves the problem.

I think I didn't think about extension because I wanted the table schema to be able to say explicitly what it expected for the content of the cell. But I see now that, at least as the XSD spec is currently formulated, you can't do that and allow for anything in the specialized element type's content.

I don't think that creating a new schema in the table namespace is correct--the whole point is that the schema defined for a given namespace is invariant--that's what makes it an object. If you want to modify it it has to be in a different namespace via subtyping (just as you would do in object-oriented code that provides APIs and base classes intended to be subclassed).

I think this approach using extension satisfies my requirements, but I'll have to do some more experimentation.

7:08 AM  
Anonymous Anonymous said...

The problem with substitution groups is that an element can only be in one substitution group. If you have a table module and a list module, you'd want my:p to be the child of both the tbl:td and the lst:li elements, but you can't, as my:p would have to be part of both the tbl:td_contents and the lst:li_contents substitution group. Instead you nee to create specialized my:p-in-td and my:p-in-li elements.

With redefinition of the tbl:td and lst:li types you don't have this problem. Less ideal from a modelling perspective, but nicer xml structures.

11:23 AM  

Post a Comment

<< Home