Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Tuesday, April 24, 2007

DITA Standard Practice: Always Make Local Shells

As I started doing serious work with DITA and in particular implementing specializations, it became clear to me that the first thing anyone using DITA should do is make local copies of all the DITA-provided shell DTDs or schemas. This should just be automatic. [Note: I'm going to use the term "DTD" to mean "DTD or schema" from now on.]

Why?

DITA, like DocBook, is a generic standard designed specifically to allow controlled local configuration and extension. Anyone who uses DITA in any non-trivial way will need to do at least some configuration, if not specialization, before too long. It is the rare user of DITA who genuinely needs to use all of the topic types and domains reflected in the base DITA distribution. Even if you're not using any specializations you probably only need some of the domains DITA provides out of the box.

By "configuration" I mean adjusting the set of topic types and domains that are or are not used for a given document set. For example, if you're not documenting interactive software you probably have no use for the User Interface domain and would just as soon not have those element types available to authors. Turning off that domain for your authors is "configuration".

By "specialization" I mean new domains or topic types derived from the base DITA-defined types (see my specialization tutorial for a deeper discussion). Even if you don't develop your own specializations, it is likely that you will use specializations developed by others. This will be increasingly likely as the DITA community begins to develop more and more special-purpose specializations--this is one of the really cool things about DITA--it enables and rewards the creation of "plug-ins" that are relatively easy to create, distribute, and integrate with the base DITA document types and supporting infrastructure.

In order to do either configuration or use specializations you must create local shell DTDs that reflect the local configuration or integrate the specializations.

Since you're going to do it sooner or later, you might as well start your DITA life there and be prepared. Eat the (minor) pain up front of configuring your local environment to use your local shells and then you're set to go.

If you set up your local shells first, then as you add new DITA-aware tools to your system, you can simply configure them to use your shells from the get-go, rather than building a system of tools and a set of documents that then have to all be reconfigured later when you finally do implement local shells (or worse, you discover that your system has become such a lava flow that you can't reconfigure them meaning that you can't do any configuration or use new specializations because the cost of reconfiguration would be too high or too risky).

NOTE: when you create local shells you must give them unique global identifiers (URIs or PUBLIC IDs). You must not refer to them by the DITA-defined URIs or PUBLIC IDs. Local shells are just that, local. You create them, you own them, you name them. You should consider the DITA-defined shells and attendant module and entity files to be invariant, meaning that you should never ever modify them directly, but only use them by reference, configured using the DITA-defined configuration mechanisms (parameter entities for DTDs, named groups for schemas).

All DITA-capable tools should (dare I say "must"?) be capable of using local shells, otherwise they aren't DITA-capable, QED. Probably the biggest potential problem tool is FrameMaker, but then FrameMaker is something of a special case because it's not a true XML tool and it's design makes reconfiguration much more expensive than it is with any other XML editor you're likely to use. I'm sure it can be done but I wouldn't want to have to do it (of course, as systems integrator I might be asked to and of course I would do it but that doesn't mean I'd have to like it).

For example, I've just gone through the exercise of setting up Arbortext Editor 5.3 to support editing of heavily specialized topic types. Once you know what to do it's not too hard and is reasonably well documented in the online help. The basic process is:
  1. Put each shell DTD in its own directory, named the same as the DTD or schema file. This organization is at least suggested, if not required, by Arbortext, but it's pretty good general practice anyway (even though the base DITA distribution doesn't organize things this way, there's no reason it couldn't and I've suggested that maybe it should, just on general principles of neatness).
  2. Create an old-style (non-XML) OASIS entity catalog for mapping the URIs of your local shell DTDs to their local location. (Arbortext Editor 5.3 doesn't support XML-syntax catalogs.)
  3. For each topic or map type shell, copy the Arbortext-specific configuration and style files from the Arbortext-supplied DITA doctypes that are the closest match to your local shells. Rename as necessary per the Arbortext naming conventions.
  4. Edit the configuration files to reflect the details of your shells. This is stuff like setting the name used in the New file dialog, pointing to templates and samples, and so on. For specializations you'll need to account for new element types in the editor configuration, style sheets, and whatnot, if they require special handling.
  5. Update the Arbortext Editor catalog path to include your catalog so it can resolve the references to the DTDs.
That's it. I would expect other XML editors to require a similar process (I haven't tried setting up XMetal for these specializations yet so I don't know what its configuration details would be).

Note too that as long as you are putting your shells in your own directory structures, and not in the dtd/ directory of the DITA Open Toolkit (which you should never do), it doesn't matter what you call your shell DTDs. That is, there's no particular reason not to call your local configuration of the concept shell DTD "concept.dtd".

So if you are a new user of DITA (by which I mean somebody setting up the DITA environment for a defined set of users, not an individual author [unless you are a writing team of one, in which case you are performing both roles]) I strongly urge you to create your own local shell DTDs right now if you haven't done so already.

Labels:

Comment Spam Continued

So after turning on comment moderation, I still got two or three spam comments from Blogger.com members (which are not moderated) and blocked two from non-Blogger members. Which means that comment moderation is not very useful. At least the spam just inane and not particularly offensive.

My initial take was that it must be humans doing the spamming, but googling on "captcha bypass" quickly leads to information indicating that picture-based captcha can be cracked with 80 to 100% accuracy.

So I guess there's not much I can do about the spam.

Hmph.

It does lead to the idle thought that maybe it will be the spammers who first develop true AI in their quest to win the humans vs bots arms race....

Tuesday, April 17, 2007

Moderating Comments

Either spambots have cracked the comment captcha mechanism or humans are being paid to leave comments. In any case, I've turned on comment moderation to try to turn off the comment spam.

Labels:

Wednesday, April 11, 2007

XML Documents vs XML Data Packages

Both James Clark's recent posts on XML and JSON as well as some recent attempts I've made to describe what I do professionally with respect to XML led me to realize that there doesn't seem to be an easy way to distinguish XML documents that are intended primarily to produce human-consumed results (e.g, published books, Web pages, online help, whatever) and XML documents that are purely for program-to-program communication (e.g., the use case things like JSON are trying to address more effectively than XML necessarily does).

Also, there's a part of me that wouldn't mind XML being returned to it's more or less strictly document-centered use rather than being the all-purpose data serialization and communication language it's become. Of course that's not really a productive line of thought.

But it did make me start to think that the stuff that James and others are starting to think about reflects the historical accident that the world, and in particular the Web-based world, needed a more transparent data communication mechanism than things like CORBA and DCOM provided just when XML appeared. People with the requirement saw XML as a way to do what they needed without spending too much time thinking about whether or not it was optimal--it was there and it would work well enough and here we are.

But it leads me to think that I agree with what I think James is saying: that it's probably not a bad thing to start designing serialization languages that are optimized for the specific tasks of program-to-program communication.

The existence of such languages would not in any way threaten the status of XML as a language for (human readable) document representation.

One thing that XML has done is embedded a number of key concepts and practices into the general programming world, such as making a clearer distinction between syntax and abstraction, which sets the base for realizing that once you have the abstraction, the original syntax doesn't matter, which means you can have multiple useful syntaxes for the same abstraction. It has made the general notion of serialization to and from abstract data structures via a transparent, human-readable syntax a fundamental aspect of data processing and communication infrastructures.

I think this means that we are now at a place where the community at large can see how you could refactor the syntax part of the system without the immediate need to refactor the abstractions (which is where most of the code is bound, that is, code that operates on DOM nodes rather than code that operates on SAX events or directly on XML byte sequences).

But it seems reasonable to me to at least start planning this refactor simply in the name of system optimization. It will probably take 20 years (it took 20 years to go from SGML as published in 1986 to today, when we can clearly understand why XML isn't the best solution for some applications) but it seems doable.

While the infrastructure for XML is widely deployed and ubiquitous, we also have the advantage that that infrastructure is by and large modular (in the sense that it's provided by more or less pluggable libraries in Java, .NET, C++, and so on) and in languages that are themselves ubiquitous.

For example, if Java or .NET released core libraries with base support for something like JSON it would not be hard for application programmers to start refactoring their systems to move from using XML for data packages to using JSON. Of course the systems using heavyweight things like SOAP would have a harder row to hoe.

If we take the Flickr API as an example of an XML-based API where something like JSON might be a better fit (or at least simplify or optimize the serialization/deserialization process), it would take a few person months on the Flickr end to provide an JSON version of the API (which would have to live along side the XML version) and a few person days or weeks for each of the language-specific client-side bindings for the Flickr API to use the JSON version instead of the XML version. At some point, say in a couple of years, the XML version of the API could be retired. That seems like a reasonable refactor cost if the value of using something like JSON is non trivial (I don't have an opinion on the value of something like JSON in this case--I just don't care enough and I've never had too much patience for "but this is more elegant than that" arguments if that's your *only* argument).

The Flickr API may be a poor example only in that the data structures communicated are fairly simple, mostly just sets of name/value pairs (metadata on photos) or lists of pictures or sets or users or tags. In that use case, XML works as well as anything else.

But in a more complex use case, where the data structures serialized are more complicated, in the way James talks about, with non-trivial data types and complex composite object structures and whatnot, I can definitely see a purpose built language having real value, primarily in the ease with which programmers doing the serialization/deserialization can both design and understand the mapping from the objects to the serialized form.

I spent some time working with the STEP standard (ISO 10303), a standard for generic representation of complex data structures, originally designed for interchange of CAD drawings and 3-D models and eventually generalized into a language for general product data interchange. It provides a sophisticated data modeling language. I was involved in the subgroup that was trying to define the XML interchange representation of STEP models. This turned out to be a really hard problem precisely because of the mismatch between XML data structures and data types (String at the time) and the sophisticated STEP models. It confirmed what I already knew, which was that mapping abstract data structures to efficient and complete XML representations is hard and naive approaches based on simple samples will not work.

That means that a comparable interchange syntax that is a better match for complex data structures will have value simply by making the conceptual task easier, so that designing and understanding serialization forms is easy (or at least easier) than it is using XML.

And then I can have my XML all to myself for creating "real" documents....

Tuesday, April 10, 2007

Help Me Obi-Wan: Java and Encodings and XML

While I consider myself a pretty good Java programmer I don't actually do that much processing of XML with Java and so I've never fully internalized the details of SAX and JAXP and all that. Pretty much I just crib code that will get me a DOM and hope it works or get someone else to implement the fiddly bits.

But today I ran into a wall and all my fiddly bit colleagues are elsewhere so I thought I would ask my readers for help.

Here's what I'm trying to do:

I have XML documents with Arabic content. I read these documents into an internal data structure, do stuff, and write the result out as different XML. Should be easy.

However, I'm finding several odd things that I don't quite understand:

1. Text.getData() is *not* returning a sequence of Unicode characters, it is returning a sequence of characters that correspond one-to-one to the bytes of the UTF-8 encoding of the original Unicode characters.

That threw me because I thought XML data *was* Unicode and therefore Text.getData() should return Unicode characters, not a sequence of single-byte chars. Or have I totally misunderstood how Java manages Strings (I don't think so)?

This is solved by getting the bytes from the string returned by Text.getData() and reinterpreting them using an InputStreamReader with the encoding set to "utf-8". (Is there a better way? Have I again missed something obvious?)

2. When I save the same document as UTF-16 the DOM construction process fails with "Content not allowed in prolog", which doesn't compute because it's not conceivable that any non-trivial XML parser wouldn't handle UTF-16 correctly.

3. When re-interpreting the UTF-8 bytes into characters, it mostly works, except that at least one character, \uFE8D (Arabic Letter Alef Isolated Form), whose UTF-8 byte sequence is EF BA 8D, is reported as EF BA EF, which is not a Unicode character and is converted to \uFFFD and "?" by the input stream reader.

WTF?

I suspect that I am in fact using a crappy parser but there is so much indirection and layers and IDEs and stuff that it's very difficult, at least for me, to determine which parser I'm using, much less how to control the parser I want to use. I'm developing my code using Eclipse 3.2. I've tried setting my project to both Java 1.4 and 5.0 with no change in behavior.

For this project I have the Xerces 2.9.0 library (as reported by org.apache.xerces.impl.Version) in my classpath.

Does anyone have any idea what might be going on here?

Any help or pointers on what I might be doing wrong or how to fix it?

Friday, April 06, 2007

James Clark in the House

Norm Walsh reports that James Clark has entered the blogosphere: http://blog.jclark.com/.

Let me add my welcome to Norm's. You can bet that I'm subscribed....

DITA Specialization Tutorial Now on Xiruss.org

I have started writing a more complete DITA specialization tutorial, which will eventually cover all aspects of DITA specialization and likely lead to additional tutorials on other aspects of using DITA (using the DITA Open Toolkit, writing a Toolkit plug-in, etc.).

The tutorial itself is published on my xiruss.org site here: http://www.xiruss.org/tutorials/dita-specialization/, including a package with all the source materials as well as the generated HTML version.

The source materials are managed for development in the XIRUSS Subversion repository here: http://xiruss-t.svn.sourceforge.net/viewvc/xiruss-t/specialization_tutorial/ should you for some reason want to track the development of the files or get the very latest stuff (can't imagine why but who knows?) or just get a particular file without downloading the whole package.

The tutorial includes an improved version of the DITA attribute domain specialization tutorial I posted here a while back.

It is of course written as a set of DITA topics, which is interesting in and of itself because a tutorial is a type of document for which the DITA concept/task/reference and highly fragmented presentation paradigms are not necessarily a good match. For example, I discovered that the only way to get prev/next links from one topic to the next within a logical narrative sequence of topics is to set their parent container in the organizing map to "sequence". However, this has the effect of numbering each topic in the sequence, which makes sense for the topics that represent a logical sequence of steps within the tutorial, but not for the purely conceptual overview of what DITA specialization is. (This is what the DITA Open Toolkit does today--whether this behavior is required by the DITA spec is a more subtle question.)

So it raises some issues, like do we need a tutorial-specific set of specializations and corresponding rendering customizations to get the effects I want as a tutorial author, or does the DITA spec need to be refined to reflect these sorts of more subtle rhetorical distinctions? Are my topics that describe a sequence of steps to be performed really task or concept topics (I've coded them as concepts because even in DITA 1.1, the task topic type is too restrictive in the way it represents sequences of steps)?

This makes the activity more fun than it would otherwise be--I always like it when the things I do result in both concrete products (a useful tutorial) and help to advance the state of our understanding and, hopefully, the supporting infrastructure, in this case, by serving as an experiment in applying DITA to a type of information for which it was not directly designed (not that I'm the first to create tutorials in DITA, or even the first to think about it--see discussion around this on the DITA Users Yahoo group--but as an informal, spare-time activity, this tutorial provides more opportunity for both introspection about the process and methods and, because it's public, more opportunity for community involvement).

I've also learned a lot about using DITA and hacking the Toolkit and stuff, which makes it fun.

Now if I could just stop waking up at 5:30 a.m. to work on the thing (It's not that I want to wake up at 5:30, it's just that once I am awake and my brain starts spinning I can't go back to sleep, so I am compelled to start working. Good for productivity, bad for physical and mental health.)

Labels: