Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Wednesday, April 11, 2007

XML Documents vs XML Data Packages

Both James Clark's recent posts on XML and JSON as well as some recent attempts I've made to describe what I do professionally with respect to XML led me to realize that there doesn't seem to be an easy way to distinguish XML documents that are intended primarily to produce human-consumed results (e.g, published books, Web pages, online help, whatever) and XML documents that are purely for program-to-program communication (e.g., the use case things like JSON are trying to address more effectively than XML necessarily does).

Also, there's a part of me that wouldn't mind XML being returned to it's more or less strictly document-centered use rather than being the all-purpose data serialization and communication language it's become. Of course that's not really a productive line of thought.

But it did make me start to think that the stuff that James and others are starting to think about reflects the historical accident that the world, and in particular the Web-based world, needed a more transparent data communication mechanism than things like CORBA and DCOM provided just when XML appeared. People with the requirement saw XML as a way to do what they needed without spending too much time thinking about whether or not it was optimal--it was there and it would work well enough and here we are.

But it leads me to think that I agree with what I think James is saying: that it's probably not a bad thing to start designing serialization languages that are optimized for the specific tasks of program-to-program communication.

The existence of such languages would not in any way threaten the status of XML as a language for (human readable) document representation.

One thing that XML has done is embedded a number of key concepts and practices into the general programming world, such as making a clearer distinction between syntax and abstraction, which sets the base for realizing that once you have the abstraction, the original syntax doesn't matter, which means you can have multiple useful syntaxes for the same abstraction. It has made the general notion of serialization to and from abstract data structures via a transparent, human-readable syntax a fundamental aspect of data processing and communication infrastructures.

I think this means that we are now at a place where the community at large can see how you could refactor the syntax part of the system without the immediate need to refactor the abstractions (which is where most of the code is bound, that is, code that operates on DOM nodes rather than code that operates on SAX events or directly on XML byte sequences).

But it seems reasonable to me to at least start planning this refactor simply in the name of system optimization. It will probably take 20 years (it took 20 years to go from SGML as published in 1986 to today, when we can clearly understand why XML isn't the best solution for some applications) but it seems doable.

While the infrastructure for XML is widely deployed and ubiquitous, we also have the advantage that that infrastructure is by and large modular (in the sense that it's provided by more or less pluggable libraries in Java, .NET, C++, and so on) and in languages that are themselves ubiquitous.

For example, if Java or .NET released core libraries with base support for something like JSON it would not be hard for application programmers to start refactoring their systems to move from using XML for data packages to using JSON. Of course the systems using heavyweight things like SOAP would have a harder row to hoe.

If we take the Flickr API as an example of an XML-based API where something like JSON might be a better fit (or at least simplify or optimize the serialization/deserialization process), it would take a few person months on the Flickr end to provide an JSON version of the API (which would have to live along side the XML version) and a few person days or weeks for each of the language-specific client-side bindings for the Flickr API to use the JSON version instead of the XML version. At some point, say in a couple of years, the XML version of the API could be retired. That seems like a reasonable refactor cost if the value of using something like JSON is non trivial (I don't have an opinion on the value of something like JSON in this case--I just don't care enough and I've never had too much patience for "but this is more elegant than that" arguments if that's your *only* argument).

The Flickr API may be a poor example only in that the data structures communicated are fairly simple, mostly just sets of name/value pairs (metadata on photos) or lists of pictures or sets or users or tags. In that use case, XML works as well as anything else.

But in a more complex use case, where the data structures serialized are more complicated, in the way James talks about, with non-trivial data types and complex composite object structures and whatnot, I can definitely see a purpose built language having real value, primarily in the ease with which programmers doing the serialization/deserialization can both design and understand the mapping from the objects to the serialized form.

I spent some time working with the STEP standard (ISO 10303), a standard for generic representation of complex data structures, originally designed for interchange of CAD drawings and 3-D models and eventually generalized into a language for general product data interchange. It provides a sophisticated data modeling language. I was involved in the subgroup that was trying to define the XML interchange representation of STEP models. This turned out to be a really hard problem precisely because of the mismatch between XML data structures and data types (String at the time) and the sophisticated STEP models. It confirmed what I already knew, which was that mapping abstract data structures to efficient and complete XML representations is hard and naive approaches based on simple samples will not work.

That means that a comparable interchange syntax that is a better match for complex data structures will have value simply by making the conceptual task easier, so that designing and understanding serialization forms is easy (or at least easier) than it is using XML.

And then I can have my XML all to myself for creating "real" documents....

3 Comments:

Blogger Alexey Zakhlestin said...

This comment has been removed by the author.

9:00 AM  
Blogger Alexey Zakhlestin said...

another option for "communication format" is YAML. I believe it is better than JSON, as it allows to represent strictly typed data

9:00 AM  
Blogger Patrick Mueller said...

I've been tootin' my horn on this as well ... my blog

12:13 PM  

Post a Comment

<< Home