Dr. Macro's XML Rants: January 2014

[This is a copy of a post I made to the Yahoo DITA Users list.]

A topic of discussion that comes up quite a bit (it came up at the recent Central Texas DITA User Group meeting) is how to "do DITA" without a CMS, by which we usually mean, how to implement an authoring and production workflow for a small team of authors with limited budget without going mad?

NOTE: I'm using the term "CMS" to mean what are often called a Content Component Management (CCM) systems.

This is something I've been thinking about and doing for going on 30 years now, first at IBM and then as a consultant. At IBM we had nothing more than mainframes and line-oriented text editors along with batch composition systems yet we were able to author and manage libraries of books with sophisticated hyperlinks within and across books and across libraries. How did we do it? Mostly through some relatively simple conventions for creating IDs and references to them and a bit of discipline on the part of writing teams. We were using pre-SGML structured markup back then but the principles still apply today.

As I say in my book, DITA for Practitioners, some of my most successful client projects have not had a CMS component.

Note that I'm saying this as somebody who has worked for and still works closely with a major CMS vendor (RSI Content Solutions). In addition, as a DITA consultant who wants to work with everybody and anybody, I take some risk saying things like this since a large part of the DITA tools market revolves around CMS systems (as opposed to editors or composition systems, where market has essentially resolved on a small set of mature providers that are unlikely to change anytime soon).

So let me make it clear that I'm not suggesting that you never need a CMS--in an enterprise context you almost certainly do, and even in smaller teams or companies, lighter-weight systems like EasyDITA, DITAToo, Componize, BlueStream XDocs, and DocZone can offer significant value within the limits of tight small-team budgets.

But for many teams the cost of a CMS will always be prohibitive, either for cost or time or both, especially at the start of projects. So there is a significant part of the DITA user community for whom CMS systems are either an unaffordable luxury or something to be worked toward and justified once an initial DITA project proves itself.

In addition, an important aspect of DITA is that you can get started with DITA and be productive very quickly without having to first but a CMS in place. Even if you know you need to have a CMS, you can start small and work up to it. I have seen many documentation projects fail because too much emphasis was put on implementing the CMS first.

Many people get the idea, for whatever reason, that a CMS is cost of entry for implementing DITA and that is simply not the case for many DITA users.

The net of my current thinking is that this tool set:

git for source content management
DITA Open Toolkit for output processing
Jenkins for centralized and distributed process automation
oXygenXML for editing and local production

allows you to implement an almost-complete, low-cost DITA authoring, management, and production system. Of these four tools, only one, oXygenXML, is commercial. If you use Github to host private repositories that has a cost but it's minimal.

In particular, the combination of git, Jenkins, and the Open Toolkit enables easy implementation of centralized, automatic build processing of DITA content. Platform-as-a-service (PaaS) providers like CloudBees, OpenShift, and Amazon Web Services provide free and low-cost options for quickly setting up central servers for things like Jenkins, web sites, and so on, with varying degrees of privacy and easy-to-scale options.

The key here is low dollar cost and low labor investment to get something up and running quickly.

This doesn't include the effort needed to customize and extend the OT to meet your specific output needs--that's a separate cost dependent entirely on your specific requirements. But the community continues to improve its support for doing OT customization and the tools are continually improving, so that should get easier as time goes on (for example, Leigh White's DITA for Print book from XML Press makes doing PDF customization much easier than it was before--it's personally saved me many hours in my recent PDF customization projects).

For each of these tools there are of course suitable alternatives. I've specified these specific tools because of their ubiquity, specific features, and ease of use. But the same approach and principles could be applied to other combinations of comparable tools.

OK, so on to the question of when must you have a CMS and when can you get by without one?

A key question is what services do CMS systems provide and how critical are they and what alternatives are available?

As in all such endeavors it's a question of understanding your requirements and matching those requirements to appropriate solutions. For small tech doc teams the immediate requirements tend to be:

Centralized management of content with version control and appropriate access control
Production of appropriate deliverables
Increased reuse to reduce content redundancy
Localization

Given that understanding of very basic small-team requirements, how do the available tools align to those requirements?

Since the question is CMS vs. ad-hoc systems built from the components described above, the main question is "What do CMS systems do and how do the ad-hoc tools compare?"

CMS systems should or do provide the following services:

1. Centralized storage of content. For many groups just getting all their content into a single storage repository is a big step forward, moving things off of people's personal machines or out of departmental storage silos.

2. Version management. The management of content objects as versions in time.

3. Access control. Providing controls over who can do what with different objects under what conditions.

4. Metadata management for content objects. The ability to add custom metadata to objects in order to enable finding or support specific business processes. This includes things like classification metadata, ownership or rights metadata, and metadata specific to internal business processes or data processing.

5. Search and retrieval of content objects. The ability to search for and reliably find content objects based on their content, metadata, workflow status, etc.

6. Management of media assets. The ability to manage non-XML assets (images, videos, etc.) used by the main content objects. This typically includes support for media object metadata, format conversion, support for variants, streaming, and so on. Usually includes features to manage the large physical data storage required. Sometimes provided by dedicated Digital Asset Management (DAM) systems.

7. Link management. Includes maintaining "where used" information about content and media assets, management of addressing details, and so on.

8. Deliverable production. Managing the generation of deliverables from content stored in the CMS, e.g. running the Open Toolkit or equivalent processes.

These are all valuable features and as the volume of your content increases, as the scope of collaboration increases, and as the complexity of your re-use and linking increases, you will certainly need systems that provide these features. Implementing these services completely and well is a hard task and commercial systems are well worth the cost once you justify the need. You do not want to try to build your own system once you get to that point.

In any discussion like this you have to balance the cost of doing it yourself with the cost of buying a system. While it's easy to get started with free or low-cost tools, you can find yourself getting to a place where the time and labor cost of implementing and maintaining a do-it-yourself system is greater than the cost of licensing and integrating a commercial system. Scope creep is a looming danger in any effort of this scope. Applying agile methods and attitudes is highly recommended.

The nice thing about DITA is that, if you don't do anything too tool-specific, you should be able to transition from a DIY system to a commercial one with minimum abuse to your content. That's part of the point of XML in general and DITA in particular.

Also, keep in mind that DITA has been explicitly architected from day 1 to not require any sort of CMS system--everything you need to do with DITA can be done with files on the file system and normal tools. There is no "magic" in the DITA design (although it may feel like it to tool implementors sometimes).

So how far can you get without a dedicated CMS system?

I suggest you can get quite a long ways.

Services 1, 2, and 3: Basic data management

The first three services: centralized storage, version management, and access control, are provided by all modern source code management (SCM) tools, e.g. Subversion, git, etc. With the advent of free and low-cost services like Github, there is essentially zero barrier to using modern SCM systems for managing all your content. Git and Github in particular make it about as easy as it could be. You can use Github for free for public repositories and at a pretty low cost for private repositories or you can easily implement internal centralized git repositories within an enterprise if you have a server machine available. There are lots of good user interfaces for git, including Github's own client as well as other open-source tools like SourceTree.

Git in particular has several advantages:

It is optimized to minimize redundant storage and network bandwidth. That makes it suitable for managing binaries as well as XML content. Essentially you can just put everything in git and not worry about it.
It uses a distributed repository model, in which each user can have a full copy of the central repository to which they can commit changes locally before pushing them to the central repository. This means you can work offline and still do incremental commits of content. Because git is efficient with storage and bandwidth, it's practical to have everything you need locally, minimizing dependency on live connections to a central server.
Its branching model makes working with complex revision workflows about as easy as it can be (which is not very but it's an inherently challenging situation).

Service 4: Metadata Management

Here DITA provides its own solution in that DITA comes out of the box with a robust and fully extensible metadata model, namely the element. You can put any metadata you need in your maps and topics, either by using <data> with the @name attribute or by creating simple specializations that add new metadata element types tailored to your needs. For media assets you can either create key definitions with metadata that point to media objects or use something like the DITA for Publishers <art> and <art-ph> elements to bind <data> elements to references to media objects (unfortunately, the element does not allow <data> as a direct child through DITA 1.3).

In addition, you can use subject schemes and classification maps to impose metadata onto anything if you need to.

It is in the area of metadata management that CMS systems really start to demonstrate their value. If you need sophisticated metadata management then a CMS is probably indicated. But for many small teams, metadata management is not a critical requirement, at least not at first.

Service 5: Search and Retrieval

This is another area where CMS systems provide obvious value. If you have your content in an SCM it probably doesn't provide any particular search features.

But you can use existing search facilities, including those built into modern operating systems and those provided by your authoring tools (e.g., Oxygen's search across files and search within maps features). Even a simple grep across files can get you a long way.

If you have more implementation resources you can also look at using open-source full-text systems or XML databases like eXist and MarkLogic to do searching. It takes a bit more effort to set up but it might still be cheaper than a dedicated CMS, at least in the short term.

If your body of content is large or you spend a lot of time trying to find things or simply determining if you do or don't have something, a commercial CMS system is likely to be of value. But if your content is well understood by your authors, created in a disciplined way, and organized in a way that makes sense, then you may be able to get by without dedicated search support for quite a long time.

In addition, you can do things with maps to provide catalogs of components and so on. Neatness always counts and this is an area where a little thought and planning can go a long way.

Service 6: Management of Media Objects

This depends a lot on your specific media management requirements, but SCM systems like git and Subversion can manage binaries just fine. You can use continuous integration systems like Jenkins and open-source tools like ImageMagick to automate format conversion, metadata extraction, and so on.

If you have huge volumes of media assets, requirements like rights management, complex development workflows, and so on, then a CMS with DAM features is probably indicated.

But if you're just managing images that support your manuals, you can probably get by with some well-thought-out naming and organizational conventions and use of keys to reference your media objects.

Service 7: Link Management

This is the service where CMS systems really shine, because without a dedicated, central, DITA-aware repository that can maintain real-time knowledge of all links within your DITA content, it's difficult to answer the "where-used" question quickly. It's always possible to implement brute-force processing to do it, but SCM systems like Subversion or git are not going to do anything for you here out of the box. It's possible to implement commit-time processing to capture and update link information (which you could automate with Jenkins, for example,) but that's not something a typical small team is going to implement on their own.

On the other hand, by using clear and consistent file, key, and ID naming conventions and using keys you can make manual link management easier--that's essentially what we did at IBM all those years ago when all we had were stone knives and bear skins. The same principles still apply today.

An interesting exercise would be to use a Jenkins job to maintain a simple where-used database that's updated on commit to the documentation SCM repository. It wouldn't be that hard to do.

8. Deliverable production

CMS systems can help with process automation and the value of that is significant. However, my recent experience with setting up Jenkins to automate build processes using the Open Toolkit makes it clear that it's now pretty easy to set up DITA process automation with available tools. It takes no special knowledge beyond knowing how to set up the job itself, which is not hard, just non-obvious.

Jenkins is a "continuos integration" (CI) server that provides general facilities for running arbitrary processes trigged by things like commits to source code repositories. Jenkins is optimized for Java-based projects and has built-in support for running Ant, connecting to Subversion and git repositories, and so on. This means you can have a Jenkins job triggered when you commit objects to your Subversion or git repository, run the Open Toolkit or any other command you can script, and either simply archive the result within the Jenkins server or transfer the result somewhere else. You can implement various success/failure and quality checks and have it notify you by email or other means when something breaks. Jenkins provides a nice dashboard for getting build status, investigating builds, and so on. Jenkins is an open-source tool that is widely used within the Java development community. It's easy to install and available in all the cloud service environments. If your company develops software it's likely you already use Jenkins or an equivalent CI system that you could use to implement build automation.

My experience using CloudBees to start setting up a test environment for DITA for Publishers was particularly positive. It literally took me minutes to set up a CloudBees account, provision a Jenkins server, and set up a job that would be able to run the OT. The only thing I needed to do to the Jenkins server was install the multiple-source-code-management plugin, which just means finding it in the Jenkins plugin manager and pushing the "install" button. I had to set up a github repository to hold my configured Open Toolkit but that also just took a few minutes. Baring somebody setting up a pre-configured service that does exactly this, it's hard to see how it could be much easier.

I think that Jenkins + git coupled with low-cost cloud services like CloudBees really changes the equation, making things that would otherwise be sufficiently difficult as to put off implementation easy enough that any small team should be able to do it well within the scope of resources and time they have.

This shouldn't worry CMS vendors--it can only help to grow the DITA market and help foster more DITA projects that are quickly successful, setting the stage for those teams to upgrade to more robust commercial systems as they're requirements and experience grows. Demonstrating success quickly is essential to the success of any project and definitely for DITA projects undertaken by small Tech Doc teams who always have to struggle for budget and staff due to the nature of Tech Doc as a cost center. But a demonstrated DITA success can help to make the value of high-quality information clearer to the enterprise, creating opportunities to get more support to do more, which requires more sophisticated tools.

Labels: dita cms ccm jenkins git

Dr. Macro's XML Rants

Sunday, January 26, 2014

DITA without a CMS: Tools for Small Teams

Services 1, 2, and 3: Basic data management

Service 4: Metadata Management

Service 5: Search and Retrieval

Service 6: Management of Media Objects

Service 7: Link Management

8. Deliverable production

About Me

Disclaimer

Blogs of People Whose Opinion I Respect

Previous Posts

Archives