Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Tuesday, January 05, 2016

Some DITA and DocBook History: Common Origins, Very Different Results

The following was originally posted to the DITA Users' Yahoo Group 4 Jan 2016 in the context of a discussion of DITA vs. DocBook. My intent with this bit of history is to show how both DocBook and DITA (through it's ancestor, IBM ID Doc) started development around the same time, more or less from a single meeting. Those of us at IBM took things in one direction, those in the Unix-focused community went in a different direction.

The original post (edited for typos):

If you look at the history of DocBook and DITA both descend from the same time period, the late 80’s, when the technical communication industry in particular (but not exclusively) was trying to figure out how to apply this new SGML technology to their particular information management and document production challenges.

In the case of DocBook the genesis was primarily standardizing Unix man pages. In the case of DITA it was IBM’s attempt to standardize the markup used across the many different divisions and product groups within IBM as well as satisfy the requirements of online delivery of hyperlinked documents, something IBM was doing in the 80’s, long before anyone else outside of hypertext research groups, as far as I know.

There was a meeting the late 80’s, I think 1989, where representatives from the major software and hardware vendors met to discuss ways of standardizing the markup across their documentation, including IBM, HP, Digital Equipment, Group Bull, and one or more Unix vendors (the names escape me now—all except IBM and HP are long gone) in order to have some hope of interchange among them.

The meeting was hosted by Fred Dalrymple of the Open Software Foundation at offices in the Boston area. The work was led by Eve Maler, who was pioneering approaches to DTD design and modularization (she popularized the “pizza” model, adopted by the TEI and also reflected somewhat in DocBook and DITA). I was there with Wayne Wohler representing IBM. (Eve wrote the first book on SGML DTD design: "Developing SGML DTDs: From Text to Model to Markup”, with Jeanne El Andaloussi, who was at Group Bull at the time.)

One of the key things that Eve did was make a table that related the markup vocabularies of each participant to each other vocabulary. There was a row for “paragraph”, a row for “H1”, etc. [I’m sure I don’t have a copy of this table anywhere but it would be interesting to see it now—I have a clear picture of it in my mind but not clear enough to reproduce. But this table was, in many ways, the direct inspiration for my approach to markup design and set the direction of my technical career from then to now.]

What this table made clear was that all these languages had the same basic set of semantic elements but they all used different tag names and had different detailed rules for the content. But they all had some kind of paragraph (, , ,, etc.), headings, tables, lists, etc. (Remember that this was before HTML had been defined by Sir Berners-Lee—he based HTML off of the basic tag set in IBM’s GML Starter Set language, which predated SGML and was in use at Cern at the time Berners-Lee developed HTML.)

What Wayne and I got from this meeting was that A) there was this semantic correspondence and B) we needed a way to allow differences in markup details (tag names, content models) that still allowed interoperation. I realized that one could define a layered architecture with these base types as its foundation and, given a way to map specific element types to their bases, allow variety in the markup naming and content details while allowing interchange and common processing.

Soon after this Wayne and I, along with Don Day, Simcha Gralla, and others, started working on IBM’s SGML replacement for the GML-based BookMaster language, which was used for most of IBM’s documentation and had more than 600 element types, reflecting a very broad range of requirements. BookMaster allowed for very efficient creation of documentation delivered in print and online on 5 different computer platforms using IBM’s BookManager tool, which provided electronic books starting in the mid 80’s. But BookMaster was also big and difficult to change or extend. It suffered the same problems that all large all-encompassing vocabularies suffer: it became a tarball that was difficult to adapt to new requirements. IBM had a committee that considered BookMaster change requests and it worked on a 6-month cycle at best. BookMaster was also based on proprietary IBM composition technology, the Document Composition Facility, which was becoming obsolete with the development of PCs and more modern processing languages and systems.

At this same time Dr. Charles Goldfarb, inventor of GML and SGML, was now working on HyTime, an SGML-based language for hypertext representation. Dr. Goldfarb knew that he couldn’t impose a specific tag set but had to have a way to allow any element type to indicate what kind of HyTime thing it was. His solution was “architectural forms”, a mechanism that relied on specific SGML features to allow elements to declare how they related to the HyTime-defined element types and attributes. It also imposed basically the same content model constraints that DITA specialization imposes, namely that the content models of the derived element types had to be consistent with those of their architectural bases, but HyTime was necessarily less restrictive.

For the SGML BookMaster replacement, which we called IBM ID Document Type (IDDoc), we needed robust linking and we needed something like architectural forms. So we adopted HyTime both for linking and for the architectural forms mechanism. [As a side effect I became involved with Dr. Goldfarb and Dr. Newcomb with the development of the HyTime standard itself. You can ask my wife about “No, Charles.” sometime…]

For IBM ID Doc we defined a base set of elements that reflected the 25-or-so basic semantic elements that Eve had identified at that meeting at the OSF. The rest of the vocabulary was then build up from those base types. This layered architecture allowed the implementation of common processing while allowing local creation of new vocabulary to meet new requirements. Interchange and interoperation were preserved but the overall system became more flexible. This design was completed in about 1993 and implementation and use proceeded and continues to this day, although I understand that use of IDDoc is almost completely replaced by use of DITA within IBM. I left IBM in 1994. Don Day stayed.

Thus DITA reflects one ancestral branch from those early days of SGML application design.

Soon after or at the same time as the OSF meeting, another group of people founded the Davenport group, focused on standardizing Unix MAN pages. I was not directly involved in these meetings so I can’t comment on the details but their work became the basis for DocBook. I did attend one DocBook meeting sometime in the early 90’s (I remember I was still wearing suits per the IBM dress code, so it had to be before ’92 or ’93) and presented my attempt to use architectural forms to formally map DocBook to IDDoc and to try to plant the idea of architectural forms and layered architectures but I was not successful. I think I was seen mostly as a disruptive crank, which I probably was to some degree.

[From Fred Dalrymple’s LinkedIn page, on his time at OSF: "Designed the book style and created formatting tools for all OSF technical publications, published by Prentice-Hall. Led migration of OSF technical publications from legacy format (UNIX nroff/troff) to SGML, including definition of the OSF DTD and development of transformation tools. This work led directly to the creation of DocBook and the Topic Maps standard, ISO/IEC 13250:2000.”]

Don and Michael Priestley can give the history of the development of DITA within IBM after I left at the end of ’93 but the result is apparent today: the DITA we know and love.

In the ensuing decade between 93 and 2003 I became an editor of HyTime 2nd Edition and a founding member of the XML Working Group. I did a lot of client work developing custom SGML and XML vocabularies and tried to apply the same layered architectural model that we had defined at IBM. XML omitted the SGML features required for HyTime’s architectural forms mechanism (which is why DITA has the @class attribute it does) and the publication of the XML standard in 1997 made HyTime instantly obsolete (we published HyTime 2nd Edition in 1996, just in time for it be completely ignored by most people, although its influence is still felt in newer applications, including DITA, XLink, TEI, JATS, and DocBook).

When Don approached me in 2000 or 2001 about this DITA standard thing he was staring I was very eager to participate because I saw it as a potential way to fully realize many of the ideas I’d been working with over the previous decade or so.

[This is the end of the original posting. Obviously there is lots more history here but I think this provides some insight into how DITA and DocBook came to be. Would definitely like to hear the DocBook side of this story as I'm sure I've either omitted important events or misrepresented important aspects.]

Labels: , , , , , , , ,

Sunday, August 11, 2013

Monastic SGML: 20 Years On

In 1993 I was working at IBM with Wayne Wohler, Don Day, Simcha Gralla, and others on IBM ID Doc, the SGML replacement for IBM's GML-based Bookmaster application, which was used for all of IBM's product documentation and much of its internal documentation. Wayne worked for IBM Publishing solutions and had been one of the developers of IBM's SGML processing tool set, having taken Charles Goldfarb's original SGML parser implementation and reworked it into something appropriate for an IBM product (Charles was also an IBM employee during the time he developed the SGML standard and HyTime). Wayne had also been involved in various efforts to develop or adapt visual editors for editing GML and SGML. At the time, Wayne and I were also developing the specifications for a general authoring support system that would manage SGML, allow editing, and so on.

IBM had been doing pretty sophisticated content reuse even back in the 80's using what facilities there were in the IBM Document Composition Facility (DCF), which was the underpinning for the Bookmaster application. So we understood the requirements for modular content, sharing of small document components among publications, and so on.

We were also trying to apply the HyTime standard to IBM ID Doc's linking requirements and I was starting to work with Charles Goldfarb and Steven Newcomb on the 2nd edition of the HyTime standard.

Out of that work we started to realize that SGML, with its focus on syntax and its many features designed to make the syntax easy to type, made SGML difficult to process in the context of things like visual editors and content management systems, because they imposed sequential processing requirements on the content.

We started to realize that for the types of applications we were building, a more abstract, node-based way of viewing SGML was required and that certain SGML features got in the way of that.

Remember that this was in the early days of object-oriented programming so the general concept of managing things as trees or graphs of nodes was not as current as it is now. Also, computers were much less capable, so you couldn't just say "load all that stuff into memory and then chew on it" because the memory just wasn't there, at least not on PCs and minicomputers. For comparison, at that time, it took about 8 clock hours on an IBM mainframe to render a 500-page manual to print using the Bookmaster application. That was running over night when the load on the mainframe was relatively low.

Out of this experience Wayne and I developed the concept of "monastic SGML", which was simply choosing not to use those features of SGML that got in the way of the kind of processing we wanted to do.

We presented these ideas at the SGML '93 conference as a poster. That poster, I'm told, had a profound effect on many thought leaders in the SGML community and helped start the process that led to the development of XML. I was invited by Jon Bosak to join the "SGML on the Web" working group he was forming specifically because of monastic SGML (I left IBM at the end of 1993 and my new employer, Passage Systems, generously allowed me to both continue my SGML and HyTime standards work and join this new SGML on the Web activity, as did my next employer, ISOGEN, when I left Passage Systems in 1996).

For this, the 20th anniversary of the presentation of monastic SGML to world, Debbie Lapeyre asked if I could put up a poster reflecting on monastic SGML at the Balisage conference. I didn't have any record of the poster with me and Debbie hadn't been able to find one in years past, but I reached out to Wayne and he dug through his archives and found the original SGML source for the poster. I've reproduced that below. I was able to post the original monastic SGML poster. These are my reflections.

The text of the poster is here:

Monastic SGML

Objective

Facilitate reuse of document fragments by enabling more reliable validation of document fragments without knowing all contexts in which they are used. Secondary objective: Remove sequential processing biases from datastream whereever possible.

Assumptions

Document fragments contain a single element and its content representing a proper subtree of a document and this element is valid in every point at which the fragment is referenced.

Rules

  • Don't use inclusions except on the root element, don't use exclusions
    Inclusions and exclusions can have the effect of invalidating the content of an element in one context while it remains valid in another.
  • Do not define short reference maps in the DTD
    Short references can change the recognition of delimiters based on context which can make a fragment invalid in one context while not in another.
    Other reasons to avoid them:
    • If USEMAP declarations occur in an instance, they are inherently sequential.
    • Short references can be used to obscure the true meaning of the markup in a given context.
  • Don't use #CURRENT attributes in the DTD
    #CURRENT attribute's use of values from prior specifications can make the first occurance of a fragment invalid.
    Other reasons to avoid them:
    • This construct is inherently sequential.
  • Avoid the use of IGNORE/INCLUDE marked sections
    These marked section types make it impossible to validate the information without
    • knowing all valid combinations of conditions for all using document
    • modifying all using documents to set these conditions

If you compare XML to these rules, you can see that we certainly applied them to XML, and a lot more.

Inclusions and exceptions were a powerful, if somewhat dangerous feature of SGML DTDs, in which you could define a content model and then additionally either allow elements types that would be valid in any context descending from the element being declared (inclusions) or disallow elements types from any descendant context (exclusions). Interestingly, RelaxNG has almost this feature because you can modify base patterns to either allow additional things or disallow specific things, the difference being that the inclusion or exception only applies to the specific context, not to all descendant contexts, which was the really evil part of inclusions and exceptions. Essentially, inclusions and exceptions were a syntactic convenience that let you avoid more heavily-parameterized content models or otherwise having to craft your content models for each element type.

In DITA, you see this reflected in the DTD implementation pattern for element types where every element type's content model is fully parameterized in a way that allows for global extension (domain integration) and relatively easy override (constraint modules that simply redeclare the base content-model-defining parameter entity). DocBook and JATS (NLM) have similar patterns.

Short references allowed you to effectively define custom syntaxes that would be parsed as SGML. It was a clever feature intended to support the sorts of things we do today with Wiki markup and ASCII equation markup and so on. In many cases it allowed existing text-based syntaxes to be parsed as SGML. It was driven by the requirement to enable rapid authoring of SGML content in text editors, such as for data conversion. That requirement made sense in 1986 and even in 1996, but is much less interesting now, both because ways of authoring have improved and because there are more general tools for doing parsing and transformation that don't need to be baked into the parser for one particular data format. At the time, SGML was really the only thing out there with any sort of a general-purpose parser.

One particularly pernicious feature of shortref was that you could turn it on and off within a document instance, as we allude to in our rules above. This meant that you had to know what the current shortref set was in order to parse a given part of the document. That works fine for sequential parsing of entire documents, but fails in the case of parsing document fragments out of any large document context.

The #CURRENT default option for SGML attributes allowed you say that the last specified value for the attribute should be used as the default value. This feature was problematic for a number of reasons, but it definitely imposed a sequential processing requirement on the content. This is a feature we dropped from XML without a second thought, as far as I can remember. The semantics of attribute value inheritance or propagation are challenging at best, because they are always dependent on the specific business rules of the vocabulary. During the development of HyTime 2 we tried to work out some general mechanism for expressing the rules for attribute value propagation and gave up. In DITA you see the challenge reflected in the rules for metadata cascade within maps and from maps to topics, which are both complex and somewhat fuzzy. We're trying to clarify them in DITA 1.3 but it's hard to do. There are many edge cases.

XML still has include and ignore marked sections, but only in DTD declarations. In SGML they could go in document instances, providing a weak form of conditional processing. But for obvious reasons, that didn't work well in an authoring or management context. Modern SGML and XML applications all use element-based profiling, of course. Certainly once SGML editors like Author/Editor (now XMetal) and Arbortext ADEPT (now Arbortext Editor) were in common use, the use of conditional marked sections in SGML content largely went away.

Looking at these rules now, I'm struck by the fact that we didn't say anything about DTDs in general (that is, the requirement for them) nor anything about the use of parsed entities, which we now know are evil. We didn't say anything about markup minimization, which was a large part of what got left out of XML. We clearly still had the mind set that DTDs were either a given or a hard requirement. We no longer have that mind set.

SGML did have the notion of "subdoc" but it wasn't fully baked and it never really got used (largely because it wasn't useful, although well intentioned). You see the requirement reflected today in things like DITA maps and conref, XInclude, and similar element-based, link-based use-by-reference features. The insight that I had (and why I think XInclude is misguided) is that use-by-reference is an application-level concern, not a source-level concern, which means it's something that is done by the application, as it is in DITA, for example, and not something that should be done by the parser, as XInclude is. Because it is processed by the parser, XInclude ends up being no better than external parsed entities.

If we look at XML, it retains one markup minimization feature from SGML, default attributes. These require DTDs or XSDs or (now) RelaxNGs that use the separate DTD compatibility annotations. Except for #CURRENT, which is obviously a very bad idea, we didn't say anything about attribute defaults. I think this reflects the fact that default attributes are simply such a useful feature that they must be retained. Certainly DITA depends on them and many other vocabularies do as well, especially those developed for complex documentation.

But I can also say from personal experience that defaulted attributes still cause problems for content management, since if you have a document that does not have all the attributes in the instance and, as for DITA, you require certain attributes in order to support specific processing (e.g., specialization-aware processing) then if you don't process your documents in the context of a schema that provides the attributes, processing will fail, sometimes apparently randomly and for non-obvious reasons (at least to those not familiar with the specific attribute-based processing requirements of the document).

I later somewhat disavowed monastic SGML because I felt it put an unnecessary focus on syntax over abstraction. As I further developed my understanding of abstractions of data as distinct from their syntactic representations, I realized that the syntax to a large degree doesn't matter, and that our concerns were somewhat unwarranted because once you parse the SGML initially, you have a normalized abstract representation that largely transcends the syntax. If you can then store and manage the content in terms of the abstraction, the original syntax doesn't matter too much.

Of course, it's not quite this simple if, for example, you need to remember things like original entity references or CDATA marked sections or other syntactic details so that you can recreate them exactly. So I think my disavowing may have been perhaps itself somewhat misguided. Syntax still matters, but it's not everything. At this year's Balisage there were several interesting papers focusing on the syntax/semantics distinction and, for example, defining general approaches for treating any syntax as XML and what that means or doesn't mean.

I for one do not miss any of the features of SGML that we left out of XML and am happy, for example, to have the option of not using DTDs when I don't need or want them or want to use some other document constraint language, like XSD or RelaxNG. Wayne and I were certainly on to something and I'm proud that we made a noticeable contribution to the development of XML.

For the historical record, here is the original SGML source for the poster as recovered from Wayne's personal archive:
<h1>Monastic SGML
<h5>Objective
<p>Facilitate reuse of document fragments by enabling more reliable
validation of document fragments without knowing all contexts in which
they are used.
Secondary objective&colon; Remove
sequential processing biases from datastream whereever possible.
<h5>Assumptions
<p>Document fragments contain a single element and its content
representing a proper subtree of a document and
this element is valid in every point at which the fragment is referenced.
<h2>Rules
<ul>
<li>Don't use inclusions except on the root element, don't use exclusions
<p>Inclusions and exclusions can have the effect of invalidating the
content of an element in one context while it remains valid in another.
<li>Do not define short reference maps in the DTD
<p>Short references can change the recognition of delimiters based on
context which can make a fragment invalid in one context while not in
another.
<p>Other reasons to avoid them:
<ul compact>
<li>If USEMAP declarations occur in an instance, they are inherently
sequential.
<li>Short references can be used
to obscure the true meaning of the markup in a given
context.
</ul>
<li>Don't use #CURRENT attributes in the DTD
<p>#CURRENT attribute's use of values from prior specifications
can make the first occurance of a fragment invalid.
<p>Other reasons to avoid them:
<ul compact>
<li>This construct is inherently sequential.
</ul>
<li>Avoid the use of IGNORE/INCLUDE marked sections
<p>These marked section types make it impossible to validate the
information without
<ul compact>
<li>knowing all valid combinations of conditions for all using
document
<li>modifying all using documents to set these conditions
</ul>
</ul>

Labels: , , , , ,