Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Monday, January 15, 2007

Edubuntu: Remarkably easy to set up and use

In the spirit of other bloggers in the XML space who have recently talked about their personal experiences with technology and their children and/or Linux, I thought I would mention my experience over the weekend setting up a computer for my daughter.

My daughter has just turned three and is starting to learn her letters and numbers and how to spell a few words (e.g., her name). I decided it was time to get her her own computer but being cheap I didn't want to go so far as to actually buy one, especially not when I have a veritable scrapyard of old PCs and parts at home.

As it happens, we moved the Austin office of Innodata to new space last week and as a side effect I got to take home an ancient dual-proc PIII machine. So I decided yesterday, a cold rainy day, to try to build an Edubuntu machine. Edubuntu is a configuration of ubuntu Linux specially designed for kids and classroom use. It comes with a number of educational applications and games, including Tuxpaint, which is perfect for Dada as she learns to use the mouse and keyboard. There are some nice little learn-to-use-the-keyboard-and-mouse games as well.

I also had an LCD display that I wasn't using (in our new house there's really no need for a dedicated desktop and we don't really need or want docking stations for our laptops so the display was only being used as a console for the network firewall machine, which I needed maybe twice a year).

The machine (which had been named "Doublebot" back when it was a development support box) wouldn't come on so I pulled the power supply out of my old game machine desktop [an AMD box I built some years ago--it had gotten flaky but by that time I was in the process of becoming a parent and long hours of gaming in a room by myself were not really relevant to my now any more] and slapped it into Doublebot, along with a wireless PCI card and the not-quite-as-ancient video card from the old game machine. During this time I was also downloading the bootable CD image for Edubuntu. It did take me a while to figure out how to cable up the various drives but I did eventually get all the jumpers set right and the cables hooked up correctly. Finally the machine got to the point where it was correctly recognizing the drives and trying to boot from them (the hard drive in the machine didn't have a usable operating system on it).

By the time I got the hardware going the CD image had downloaded and I burned it to a disk. Popped the disk in the drive and it booted right up. The network connection worked, the screen resolution was correct, all the devices were recognized. It just worked. Then I just selected the "install" option and it put itself on the disk drive--I didn't have to do anything beyond select my language and keyboard layout. I let it set up the disk partition for me (I've spent so many hours over the last 10 years or so configuring disk partitions, hours that I'll never get back). I ran the software update, which updated everything to the latest versions, added a few more packages that I wanted, and verified that all the kid stuff worked.

I put the covers back on and set it up in the livingroom on Dada's little table. Booted it up and showed her how to log in (since she can spell her name she can log in herself, although she is still getting used to seeing dots instead of letters when she puts in her password). She easily spent three hours yesterday playing with Tuxpaint. She got the basic mouse skills remarkably quickly, given that she'd never really used a mouse before, although she still needs help with selecting stuff (and she can't read the message boxes that come up when she accidently clicks on things like "save" or "exit"). She can also use Tuxpaint to type words, which she likes to do.

I can't tell you how many times I've installed Linux or Windows over the years and this was by far and away the easiest it's ever been--I don't think it could have been any easier unless it had just magically appeared on the hard drive without any physical intervention from me. Of course I was using a very old computer with fairly old components (the newest part was probably the wireless PCI card and that was at least two years old), so it's no surprise that there were no driver problems or anything, but just the fit and finish was so much better than I've ever seen from a Linux distribution before. I also liked the window environment (I assume it's KDE but I really don't know what it is), partly because it's very close to Windows, which means it looks and behaves like I expect it to.

The only other thing I did was install secure shell so I could connect to the machine remotely (using Cygwin and Cygwin X11 under Windows) and that was as easy as could be using the Synaptics package manager (of course, I did know what I was doing at that point, having configured a few Linux boxes in my day).

I would like to see more games and applications for pre-literate children, but I know that that's a lot to ask of the open source community. But I would be willing to pay a fair price for applications that run under Linux (just as I would for Windows-based apps).

Coupled with the latest versions of Open Office, which seems to finally be able to really handle MS Office stuff completely enough, it might be time to take another look at going to Linux (something I did some years ago but finally got beaten down, in particular by the lack of a version of Arbortext Editor that would run on Linux, back when Arbortext Editor was central to a lot of my work as an integrator, as well as a change in the pricing for VMWare, which enabled running Windows in a virtual machine).



Friday, January 05, 2007

Specializing xi:include

I've posted before about how useful it is to specialize the XInclude include element--it makes authoring easier, it lets you define constraints on what can be referenced, etc.

But until now I'd not really appreciated another serious benefit: It avoids ambiguous content models.

I ran into this in the process of modifying the DocBook 5.0RC1 XSD schemas to add xincludes. The obvious approach of just adding xi:include wherever something that could be included is allowed did not work because it created all sorts of ambiguity problems. Doh!

Consider this content model from DocBook schemas (somewhat modified by me for my local use):
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element ref="docbook:glossary"/>
<xs:element ref="docbook:bibliography"/>
<xs:element ref="docbook:index"/>
<xs:element ref="docbook:toc"/>
<xs:group ref="dbparms:all_blocks" maxOccurs="unbounded"/>
<xs:element minOccurs="0" maxOccurs="unbounded" ref="docbook:section"/>
<xs:element maxOccurs="unbounded" ref="docbook:section"/>
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element ref="docbook:glossary"/>
<xs:element ref="docbook:bibliography"/>
<xs:element ref="docbook:index"/>
<xs:element ref="docbook:toc"/>
The intuitive thing would be to allow xi:include in each place where section or section-like things are allowed.

But this creates a horribly ambiguous content model. Now I happen to thing that the ambiguity rules are completely bogus, nevertheless, having chosen to live in XSD land I'm stuck with them (at least for now).

But it should be immediately obvious that if we specialize xi:include to reflect the specific element types of the things we want to include, for example docbook:section_include, then the ambiguity problem goes away because you'll be adding tokens with the same distinction as the existing tokens, so you can never create an ambiguity that wasn't already there.

I also observe that since xi:include's complex type is named named then you can do the specialization formally using substitution groups at the XSD level. Hmmm.


DocBook, Schemas, and Customization

Back in March I was experimenting with trying to put the CALS table element types into their own namespace and then using those types from the context of a different namespace, but with elements from the using namespace allowed in the content of table cells. That led me to ask how best to do it on the XML Schema developer list:

If I understand the result of that discussion, the solution was to define an element type that is allowed w/in the cell element that is the head of a substitution group. You make that element abstract so that it can't itself be used in instances. To customize the elements allowed within the table cell you create your own subtype of the abstract element and give it whatever content model you want and put your type in the substitution group of the base type. Whew.

This works, but has the (potential) problem that it requires an extra, otherwise unnecessary, level of containment if you want the name of the cell element to be invariant (because, for example, you have processors that can't look at its type hierarchy, which would be the normal case today [because as far as I know the only tool that supports type-aware XSLT processing is the for-money version of Saxon]).

If you don't care about the cell type name then you can just make it the head of the substitution group. I suppose you could have the local name be invariant, which is a bit of a hack but would probably mostly work or at least could be made to work easily in XSLTs, such as the ubiquitous DocBook CALS table processing code and all of its many derivatives.

Fine, so that works for tables, but what about a more general case, like DocBook itself?

I've started the exercise/experiment of using the new DocBook 5.0RC1 XSD schemas as a base for creating a customized doctype. As part of this customization I want to remove unneeded elements, add new elements, and generally modify content models here and there.

My first approach was completely brute force: I just copied the DocBook declarations, changed the namespace to my own, and modified things as I needed to. This is essentially the same thing as you would do pre version 5 where there is no namespace (and therefore no clear way to distinguish core DocBook constructs from your customzations at the name level).

This was easy enough to do (once I factored out the appropriate groups, which were not in the generated XSD schemas) but it's not very satisfying:

- The elements that come straight from DocBook are not in the DocBook namespace, so processors that actually look at the namespace and expect it to be docbook (that is, processors that don't just look at the local names), will fail to recognize my DocBook elements as DocBook.

- There's still no distinction between the base DocBook elements and new element types I've added.

- Reacting to new versions of DocBook will be difficult and tedious because I'll have to manually copy changes from the base DocBook schema to my schema.

- It sort of misses the point of having a parameterized set of element types that are designed to be refined and extended.

[NOTE: telling me to use the RelaxNG versions of the schemas is not an option. See my earlier post on RelaxNG and schemas.]

What I'd like to do is from my top-level schema in my namespace configure the groups used in the various content models to reflect both my removal of unneeded elements from the core DocBook declarations and my addition of new elements in my namespace (keeping them clearly distinct from the base DocBook elements). I'd also like to, as appropriate, use DocBook types as the base for restriction (unfortunately, extension in XSD schemas is essentially useless since you can only add things to the end of content models, you can't do the equivalent of Relax's "interleave").

So my next experiment is to pull the groups out into a separate namespace. This results in a separate XSD document that is then intended to be copied and modified by the using top-level schema in order to modify the content models as needed. I've done this far enough to let me both add my own element types from my namespace and customize the content models.

This results in a system of two XSD files for DocBook as distributed (not counting the little ancillary XSDs like xml.xsd and xinclude.xsd):

- docbook_parms.xsd -- Contains all the attribute sets and groups. Imports docbook.xsd.

- docbook.xsd -- The base DocBook declarations, imports docbook_parms.xsd

To create a custom DocBook-based DTD I do the following:

1. Copy docbook_parms.xsd to myschema_docbook_parms.xsd and add to it an import of my schema (myschema.xsd)

2. Modify (or copy) docbook.xsd and change the existing import of docbook_parms.xsd to instead point to myschema_docbook_parms.xsd.

3. Create myschema.xsd that imports both docbook.xsd and myschema_docbook_parms.xsd. This schema declares any new element types I need (in its own namespace).

4. Modify the groups in myschema_docbook_parms.xsd as needed to reflect my desired changes.

This feels better but it's still not completely satisfactory. In particular, it requires that you still modify the base DocBook schema in order to change the URL on the import of the parameter file. But that's it--otherwise the base XSD is unmodified and my local element types are in their own namespace. It would be really nice if you could do something like a substitution group but with groups instead of element types--I think that would be much closer to being a replacement for parameter entities then XSD substitution groups are.

Unfortunately, the DocBook XSDs as currently supplied don't make this very easy. For a complete solution you'd want to have a group for every element that has a unique content model. These groups would then make it easy to locally tweak the content models as needed without having to do anything to the original declarations. Also, there are elements that are clearly subtypes of general types (e.g., chapter and appendix are both instances of an [undefined] "ChapterDivision") and it would be useful to have these types actually declared.

So this works and it feels much closer to what I think the real intent of DocBook's customization mechanism always was (even though the reality was that you were just making syntactic changes to a copy of the original DTD declarations).

But I'm wondering if I've missed an easier way to do it? I don't think so because substitution groups won't work in this case (XSD's rules for what can substitute for what are too restrictive, at least with XSD 1.0 and raise the invariant name problem decribed above). But I can't claim to be an XSD wizard so it's quite possible I've missed something.

In any case, this approach does address what has historically been one of my big complaints about DocBook: until now there was no way, looking at a given document instance, to know what parts of it were base DocBook and which were local modifications, without doing some sort of tedious inspection against the base DocBook declaration set--there was nothing in either the document instance or its local declaration set that told you what was and wasn't DocBook. This was because DocBook had no defined mechanism for classifying things as being or not being from DocBook (e.g., something like DITA's class= attribute or HyTime's architectural form mechanism). Namespaces do give you this, as long as you respect the namespace and don't add your own element types to the DocBook namespace (which of course you could do and again the only way to detect it would a comparison of your declarations with the base DocBook declarations). But if you respect the namespace then distinctions are clear.

So for now I'm satisfied with this approach. We'll see how I feel after I've done a bit more work with the stuff I'm working on....