Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Friday, February 24, 2006

Rants Preview

I'm in the process of moving house and I've been sick and blah blah blah haven't really had time and energy to post to this blog but I thought I would at least preview some of the rants kicking around in my head that I do plan to write about. These are things that I've been thinking about for a long time and/or have already ranted about in one place or another but haven't necessarily written down in as cogent a fashion as I should have. These are in no particular order:

  • External parsed entities are bad and you should never use them

    Short form: Entities are not reusable objects and only lead to pain. Besides, you should be using schemas (no entity mechanism) anyway. It was our mistake to allow them to remain in XML and I for one am as guilty of that mistake as anyone. I now regret it.
  • I was wrong (sort of) about namespaces

    Short form: I said at the time that namespaces were wrong because they didn't define a standard way to bind a namespace to an (abstract) application. In practice it didn't matter and as it happens, schemas (and similar document constraint mechanisms) provide a standard way to do this binding (sort of).
  • XML lacks a standard way to identify abstract applications as distinct from their associated namespaces and schemas

    Short form: An abstract application (for example DITA, your corporate technical documentation system, a cross-industry data interchange scheme, etc.) may involve any number of document types and namespaces. There is no defined way to give a name to the application, as an abstraction, and then formally map that name to the set of namespaces and their associated schemas, application semantics documentation, and other related artifacts. In practice it's not clear that this level of formality is needed (which is probably why we don't have it) but it still seems like, for completeness, such a mechanism should exist. However, I think that a lot of people's assumptions (including mine) about how people would deploy and use widely-used, ubiguitous applications were wrong. If the Web has taught us anything it's that sometimes a solution that seems a little too simpleminded is just right. Go figure.
  • All XML content management systems (with very few, if any, exceptions) are wrong

    Short form: any system that manages XML storage at the element level is fundamentally flawed, unnecessarily complicated, doomed to performance and maintenance problems, and just plain misguided. This does not apply to systems that are only intended to enable retrieval, such as MarkLogic. Note that indexing at the element level (which you have to do) is different from storage at the element level.
  • Exactly one XSD schema doc per namespace

    Short form: things get funky when there are multiple XSD documents for a given namespace in the same processing/storage/management environment absent a well-understood mechanism for distinguishing them.
  • All newly-defined element types (that is, element types defined from this time forward) should be in a namespace other than the no-namespace namespace. Legacy applications should be reworked to use namespaces as soon as practical.

    Short form: Namespaces enable unambiguous binding of documents to constraints in a non-author-subvertible way. Namespaces enable clear and unambiguous integration of different vocabularies into a single composite document type.
  • PUBLIC identifiers are bogus and pointless and there is no reason to ever use them, even with DOCTYPE declarations.

    Short form: First, you shouldn't use external parsed entities or DOCTYPE declarations anyway (in which case PUBLIC IDs aren't even an option). Beyond that, in XML, PUBLIC identifiers are redundant with URIs for the same resource and therefore just add another name to an already croweded world of names to be managed. OASIS catalogs can remap URIs as easily as PUBLIC IDs so there's no indirection advantage there (and there never was--they were bogus in SGML too but I don't think any of us really appreciated it at the time).
  • XInclude is good as far as it goes but it gets ID handling during transclusion wrong

    Short form: It is unnecessarily and inappropriately constraining to require XML IDs to be unique among all the members of a compound document. XInclude processors must be capable of rewriting IDs and references to them so that the transcluded result retains the appropriate uniquess constraints. I understand why XInclude imposes this requirement but in practice it is not useful, especially in the context of document authoring systems. (I wrote a paper about this for one of the recent XML Europe conferences)
  • XInclude is not, as specified, appopriate for document authoring.

    Short form: Practical authoring requires that XInclude elements be specialized so that references can have context constraints imposed and so that they can express constraints on what can and cannot be referenced. (Also in my XML Europe paper).
  • Xerces, while otherwise an exemplary tool and a key part of any Java-based XML processing system, is fundamentally flawed in how it enables/does URL resolution via OASIS catalogs

    Short form: OASIS catalogs clearly enable and expect URLs to be recursively mappable via a single catalog. That is, an XML processor doing catalog-aware resource resolution should continue to try to use the available catalogs to resolve a URL until all catalog entries are exhausted--only then do you attempt to resolve the result URL to a resource. Out of the box, Xerces does not do this and provides, as far as I can tell, no API to enable it. In particular, Xerces conflates or confuses entity resolution with resource resolution. I tried to find a code fix but the code was too convoluted and the problem seemed to be a fundamental design flaw that would require significant change to the Xerces API. I submitted a bug but was essentially told "not a bug". But it's a bug. Ask Norm.


That's all I can think of this morning. I'm sure there are more.

2 Comments:

Anonymous Anonymous said...

So if I shouldn't use entity references and I shouldn't use Xinclude for big documentation projects, then what options do I have for reuse?

1:05 PM  
Blogger Eliot Kimber said...

I'm not saying you shouldn't use XInclude--far from it. What I am saying is that XInclude--as specified--is not really appropriate for authoring because it offers no defined specialization mechanism.

But XInclude-style use-by-reference is what I recommend. See my paper on the subject, "Modular Information: Using XInclude to Support Re-Use for Authoring and Production", here: http://www.idealliance.org/papers/dx_xmle04/papers/03-05-01/03-05-01.html

9:36 AM  

Post a Comment

<< Home