Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Friday, January 05, 2007

DocBook, Schemas, and Customization

Back in March I was experimenting with trying to put the CALS table element types into their own namespace and then using those types from the context of a different namespace, but with elements from the using namespace allowed in the content of table cells. That led me to ask how best to do it on the XML Schema developer list:

If I understand the result of that discussion, the solution was to define an element type that is allowed w/in the cell element that is the head of a substitution group. You make that element abstract so that it can't itself be used in instances. To customize the elements allowed within the table cell you create your own subtype of the abstract element and give it whatever content model you want and put your type in the substitution group of the base type. Whew.

This works, but has the (potential) problem that it requires an extra, otherwise unnecessary, level of containment if you want the name of the cell element to be invariant (because, for example, you have processors that can't look at its type hierarchy, which would be the normal case today [because as far as I know the only tool that supports type-aware XSLT processing is the for-money version of Saxon]).

If you don't care about the cell type name then you can just make it the head of the substitution group. I suppose you could have the local name be invariant, which is a bit of a hack but would probably mostly work or at least could be made to work easily in XSLTs, such as the ubiquitous DocBook CALS table processing code and all of its many derivatives.

Fine, so that works for tables, but what about a more general case, like DocBook itself?

I've started the exercise/experiment of using the new DocBook 5.0RC1 XSD schemas as a base for creating a customized doctype. As part of this customization I want to remove unneeded elements, add new elements, and generally modify content models here and there.

My first approach was completely brute force: I just copied the DocBook declarations, changed the namespace to my own, and modified things as I needed to. This is essentially the same thing as you would do pre version 5 where there is no namespace (and therefore no clear way to distinguish core DocBook constructs from your customzations at the name level).

This was easy enough to do (once I factored out the appropriate groups, which were not in the generated XSD schemas) but it's not very satisfying:

- The elements that come straight from DocBook are not in the DocBook namespace, so processors that actually look at the namespace and expect it to be docbook (that is, processors that don't just look at the local names), will fail to recognize my DocBook elements as DocBook.

- There's still no distinction between the base DocBook elements and new element types I've added.

- Reacting to new versions of DocBook will be difficult and tedious because I'll have to manually copy changes from the base DocBook schema to my schema.

- It sort of misses the point of having a parameterized set of element types that are designed to be refined and extended.

[NOTE: telling me to use the RelaxNG versions of the schemas is not an option. See my earlier post on RelaxNG and schemas.]

What I'd like to do is from my top-level schema in my namespace configure the groups used in the various content models to reflect both my removal of unneeded elements from the core DocBook declarations and my addition of new elements in my namespace (keeping them clearly distinct from the base DocBook elements). I'd also like to, as appropriate, use DocBook types as the base for restriction (unfortunately, extension in XSD schemas is essentially useless since you can only add things to the end of content models, you can't do the equivalent of Relax's "interleave").

So my next experiment is to pull the groups out into a separate namespace. This results in a separate XSD document that is then intended to be copied and modified by the using top-level schema in order to modify the content models as needed. I've done this far enough to let me both add my own element types from my namespace and customize the content models.

This results in a system of two XSD files for DocBook as distributed (not counting the little ancillary XSDs like xml.xsd and xinclude.xsd):

- docbook_parms.xsd -- Contains all the attribute sets and groups. Imports docbook.xsd.

- docbook.xsd -- The base DocBook declarations, imports docbook_parms.xsd

To create a custom DocBook-based DTD I do the following:

1. Copy docbook_parms.xsd to myschema_docbook_parms.xsd and add to it an import of my schema (myschema.xsd)

2. Modify (or copy) docbook.xsd and change the existing import of docbook_parms.xsd to instead point to myschema_docbook_parms.xsd.

3. Create myschema.xsd that imports both docbook.xsd and myschema_docbook_parms.xsd. This schema declares any new element types I need (in its own namespace).

4. Modify the groups in myschema_docbook_parms.xsd as needed to reflect my desired changes.

This feels better but it's still not completely satisfactory. In particular, it requires that you still modify the base DocBook schema in order to change the URL on the import of the parameter file. But that's it--otherwise the base XSD is unmodified and my local element types are in their own namespace. It would be really nice if you could do something like a substitution group but with groups instead of element types--I think that would be much closer to being a replacement for parameter entities then XSD substitution groups are.

Unfortunately, the DocBook XSDs as currently supplied don't make this very easy. For a complete solution you'd want to have a group for every element that has a unique content model. These groups would then make it easy to locally tweak the content models as needed without having to do anything to the original declarations. Also, there are elements that are clearly subtypes of general types (e.g., chapter and appendix are both instances of an [undefined] "ChapterDivision") and it would be useful to have these types actually declared.

So this works and it feels much closer to what I think the real intent of DocBook's customization mechanism always was (even though the reality was that you were just making syntactic changes to a copy of the original DTD declarations).

But I'm wondering if I've missed an easier way to do it? I don't think so because substitution groups won't work in this case (XSD's rules for what can substitute for what are too restrictive, at least with XSD 1.0 and raise the invariant name problem decribed above). But I can't claim to be an XSD wizard so it's quite possible I've missed something.

In any case, this approach does address what has historically been one of my big complaints about DocBook: until now there was no way, looking at a given document instance, to know what parts of it were base DocBook and which were local modifications, without doing some sort of tedious inspection against the base DocBook declaration set--there was nothing in either the document instance or its local declaration set that told you what was and wasn't DocBook. This was because DocBook had no defined mechanism for classifying things as being or not being from DocBook (e.g., something like DITA's class= attribute or HyTime's architectural form mechanism). Namespaces do give you this, as long as you respect the namespace and don't add your own element types to the DocBook namespace (which of course you could do and again the only way to detect it would a comparison of your declarations with the base DocBook declarations). But if you respect the namespace then distinctions are clear.

So for now I'm satisfied with this approach. We'll see how I feel after I've done a bit more work with the stuff I'm working on....



Blogger John Cowan said...

How to box with one hand tied behind your back, eh?

1:25 AM  
Anonymous Anonymous said...

Hi Eliot,
I agree with John ;-)

It is much easier to create your customization in RELAX NG and then generate WXS from this customized RELAX NG schema if you really need WXS.

DocBook RELAX NG schema is damn easy to customize it was designed with this in mind. Other DocBook schemas like WXS and DTD are just automatically generated and not well suited for customization.


4:12 PM  

Post a Comment

<< Home