Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Sunday, August 11, 2013

Monastic SGML: 20 Years On

In 1993 I was working at IBM with Wayne Wohler, Don Day, Simcha Gralla, and others on IBM ID Doc, the SGML replacement for IBM's GML-based Bookmaster application, which was used for all of IBM's product documentation and much of its internal documentation. Wayne worked for IBM Publishing solutions and had been one of the developers of IBM's SGML processing tool set, having taken Charles Goldfarb's original SGML parser implementation and reworked it into something appropriate for an IBM product (Charles was also an IBM employee during the time he developed the SGML standard and HyTime). Wayne had also been involved in various efforts to develop or adapt visual editors for editing GML and SGML. At the time, Wayne and I were also developing the specifications for a general authoring support system that would manage SGML, allow editing, and so on.

IBM had been doing pretty sophisticated content reuse even back in the 80's using what facilities there were in the IBM Document Composition Facility (DCF), which was the underpinning for the Bookmaster application. So we understood the requirements for modular content, sharing of small document components among publications, and so on.

We were also trying to apply the HyTime standard to IBM ID Doc's linking requirements and I was starting to work with Charles Goldfarb and Steven Newcomb on the 2nd edition of the HyTime standard.

Out of that work we started to realize that SGML, with its focus on syntax and its many features designed to make the syntax easy to type, made SGML difficult to process in the context of things like visual editors and content management systems, because they imposed sequential processing requirements on the content.

We started to realize that for the types of applications we were building, a more abstract, node-based way of viewing SGML was required and that certain SGML features got in the way of that.

Remember that this was in the early days of object-oriented programming so the general concept of managing things as trees or graphs of nodes was not as current as it is now. Also, computers were much less capable, so you couldn't just say "load all that stuff into memory and then chew on it" because the memory just wasn't there, at least not on PCs and minicomputers. For comparison, at that time, it took about 8 clock hours on an IBM mainframe to render a 500-page manual to print using the Bookmaster application. That was running over night when the load on the mainframe was relatively low.

Out of this experience Wayne and I developed the concept of "monastic SGML", which was simply choosing not to use those features of SGML that got in the way of the kind of processing we wanted to do.

We presented these ideas at the SGML '93 conference as a poster. That poster, I'm told, had a profound effect on many thought leaders in the SGML community and helped start the process that led to the development of XML. I was invited by Jon Bosak to join the "SGML on the Web" working group he was forming specifically because of monastic SGML (I left IBM at the end of 1993 and my new employer, Passage Systems, generously allowed me to both continue my SGML and HyTime standards work and join this new SGML on the Web activity, as did my next employer, ISOGEN, when I left Passage Systems in 1996).

For this, the 20th anniversary of the presentation of monastic SGML to world, Debbie Lapeyre asked if I could put up a poster reflecting on monastic SGML at the Balisage conference. I didn't have any record of the poster with me and Debbie hadn't been able to find one in years past, but I reached out to Wayne and he dug through his archives and found the original SGML source for the poster. I've reproduced that below. I was able to post the original monastic SGML poster. These are my reflections.

The text of the poster is here:

Monastic SGML

Objective

Facilitate reuse of document fragments by enabling more reliable validation of document fragments without knowing all contexts in which they are used. Secondary objective: Remove sequential processing biases from datastream whereever possible.

Assumptions

Document fragments contain a single element and its content representing a proper subtree of a document and this element is valid in every point at which the fragment is referenced.

Rules

  • Don't use inclusions except on the root element, don't use exclusions
    Inclusions and exclusions can have the effect of invalidating the content of an element in one context while it remains valid in another.
  • Do not define short reference maps in the DTD
    Short references can change the recognition of delimiters based on context which can make a fragment invalid in one context while not in another.
    Other reasons to avoid them:
    • If USEMAP declarations occur in an instance, they are inherently sequential.
    • Short references can be used to obscure the true meaning of the markup in a given context.
  • Don't use #CURRENT attributes in the DTD
    #CURRENT attribute's use of values from prior specifications can make the first occurance of a fragment invalid.
    Other reasons to avoid them:
    • This construct is inherently sequential.
  • Avoid the use of IGNORE/INCLUDE marked sections
    These marked section types make it impossible to validate the information without
    • knowing all valid combinations of conditions for all using document
    • modifying all using documents to set these conditions

If you compare XML to these rules, you can see that we certainly applied them to XML, and a lot more.

Inclusions and exceptions were a powerful, if somewhat dangerous feature of SGML DTDs, in which you could define a content model and then additionally either allow elements types that would be valid in any context descending from the element being declared (inclusions) or disallow elements types from any descendant context (exclusions). Interestingly, RelaxNG has almost this feature because you can modify base patterns to either allow additional things or disallow specific things, the difference being that the inclusion or exception only applies to the specific context, not to all descendant contexts, which was the really evil part of inclusions and exceptions. Essentially, inclusions and exceptions were a syntactic convenience that let you avoid more heavily-parameterized content models or otherwise having to craft your content models for each element type.

In DITA, you see this reflected in the DTD implementation pattern for element types where every element type's content model is fully parameterized in a way that allows for global extension (domain integration) and relatively easy override (constraint modules that simply redeclare the base content-model-defining parameter entity). DocBook and JATS (NLM) have similar patterns.

Short references allowed you to effectively define custom syntaxes that would be parsed as SGML. It was a clever feature intended to support the sorts of things we do today with Wiki markup and ASCII equation markup and so on. In many cases it allowed existing text-based syntaxes to be parsed as SGML. It was driven by the requirement to enable rapid authoring of SGML content in text editors, such as for data conversion. That requirement made sense in 1986 and even in 1996, but is much less interesting now, both because ways of authoring have improved and because there are more general tools for doing parsing and transformation that don't need to be baked into the parser for one particular data format. At the time, SGML was really the only thing out there with any sort of a general-purpose parser.

One particularly pernicious feature of shortref was that you could turn it on and off within a document instance, as we allude to in our rules above. This meant that you had to know what the current shortref set was in order to parse a given part of the document. That works fine for sequential parsing of entire documents, but fails in the case of parsing document fragments out of any large document context.

The #CURRENT default option for SGML attributes allowed you say that the last specified value for the attribute should be used as the default value. This feature was problematic for a number of reasons, but it definitely imposed a sequential processing requirement on the content. This is a feature we dropped from XML without a second thought, as far as I can remember. The semantics of attribute value inheritance or propagation are challenging at best, because they are always dependent on the specific business rules of the vocabulary. During the development of HyTime 2 we tried to work out some general mechanism for expressing the rules for attribute value propagation and gave up. In DITA you see the challenge reflected in the rules for metadata cascade within maps and from maps to topics, which are both complex and somewhat fuzzy. We're trying to clarify them in DITA 1.3 but it's hard to do. There are many edge cases.

XML still has include and ignore marked sections, but only in DTD declarations. In SGML they could go in document instances, providing a weak form of conditional processing. But for obvious reasons, that didn't work well in an authoring or management context. Modern SGML and XML applications all use element-based profiling, of course. Certainly once SGML editors like Author/Editor (now XMetal) and Arbortext ADEPT (now Arbortext Editor) were in common use, the use of conditional marked sections in SGML content largely went away.

Looking at these rules now, I'm struck by the fact that we didn't say anything about DTDs in general (that is, the requirement for them) nor anything about the use of parsed entities, which we now know are evil. We didn't say anything about markup minimization, which was a large part of what got left out of XML. We clearly still had the mind set that DTDs were either a given or a hard requirement. We no longer have that mind set.

SGML did have the notion of "subdoc" but it wasn't fully baked and it never really got used (largely because it wasn't useful, although well intentioned). You see the requirement reflected today in things like DITA maps and conref, XInclude, and similar element-based, link-based use-by-reference features. The insight that I had (and why I think XInclude is misguided) is that use-by-reference is an application-level concern, not a source-level concern, which means it's something that is done by the application, as it is in DITA, for example, and not something that should be done by the parser, as XInclude is. Because it is processed by the parser, XInclude ends up being no better than external parsed entities.

If we look at XML, it retains one markup minimization feature from SGML, default attributes. These require DTDs or XSDs or (now) RelaxNGs that use the separate DTD compatibility annotations. Except for #CURRENT, which is obviously a very bad idea, we didn't say anything about attribute defaults. I think this reflects the fact that default attributes are simply such a useful feature that they must be retained. Certainly DITA depends on them and many other vocabularies do as well, especially those developed for complex documentation.

But I can also say from personal experience that defaulted attributes still cause problems for content management, since if you have a document that does not have all the attributes in the instance and, as for DITA, you require certain attributes in order to support specific processing (e.g., specialization-aware processing) then if you don't process your documents in the context of a schema that provides the attributes, processing will fail, sometimes apparently randomly and for non-obvious reasons (at least to those not familiar with the specific attribute-based processing requirements of the document).

I later somewhat disavowed monastic SGML because I felt it put an unnecessary focus on syntax over abstraction. As I further developed my understanding of abstractions of data as distinct from their syntactic representations, I realized that the syntax to a large degree doesn't matter, and that our concerns were somewhat unwarranted because once you parse the SGML initially, you have a normalized abstract representation that largely transcends the syntax. If you can then store and manage the content in terms of the abstraction, the original syntax doesn't matter too much.

Of course, it's not quite this simple if, for example, you need to remember things like original entity references or CDATA marked sections or other syntactic details so that you can recreate them exactly. So I think my disavowing may have been perhaps itself somewhat misguided. Syntax still matters, but it's not everything. At this year's Balisage there were several interesting papers focusing on the syntax/semantics distinction and, for example, defining general approaches for treating any syntax as XML and what that means or doesn't mean.

I for one do not miss any of the features of SGML that we left out of XML and am happy, for example, to have the option of not using DTDs when I don't need or want them or want to use some other document constraint language, like XSD or RelaxNG. Wayne and I were certainly on to something and I'm proud that we made a noticeable contribution to the development of XML.

For the historical record, here is the original SGML source for the poster as recovered from Wayne's personal archive:
<h1>Monastic SGML
<h5>Objective
<p>Facilitate reuse of document fragments by enabling more reliable
validation of document fragments without knowing all contexts in which
they are used.
Secondary objective&colon; Remove
sequential processing biases from datastream whereever possible.
<h5>Assumptions
<p>Document fragments contain a single element and its content
representing a proper subtree of a document and
this element is valid in every point at which the fragment is referenced.
<h2>Rules
<ul>
<li>Don't use inclusions except on the root element, don't use exclusions
<p>Inclusions and exclusions can have the effect of invalidating the
content of an element in one context while it remains valid in another.
<li>Do not define short reference maps in the DTD
<p>Short references can change the recognition of delimiters based on
context which can make a fragment invalid in one context while not in
another.
<p>Other reasons to avoid them:
<ul compact>
<li>If USEMAP declarations occur in an instance, they are inherently
sequential.
<li>Short references can be used
to obscure the true meaning of the markup in a given
context.
</ul>
<li>Don't use #CURRENT attributes in the DTD
<p>#CURRENT attribute's use of values from prior specifications
can make the first occurance of a fragment invalid.
<p>Other reasons to avoid them:
<ul compact>
<li>This construct is inherently sequential.
</ul>
<li>Avoid the use of IGNORE/INCLUDE marked sections
<p>These marked section types make it impossible to validate the
information without
<ul compact>
<li>knowing all valid combinations of conditions for all using
document
<li>modifying all using documents to set these conditions
</ul>
</ul>

Labels: , , , , ,

2 Comments:

Blogger Steve Calderwood said...

Wow! This was incredibly interesting and enlightening. It is so cool to see an important piece of XML's history recovered, explained, and reviewed. Thanks for sharing it, Eliot.

-Steven

1:26 PM  
Blogger Unknown said...

Interesting history. Long live the
tag :)

5:56 PM  

Post a Comment

<< Home