Dr. Macro's XML Rants: XCMTDMW: Import is Everything, Part 3

I hope I'm starting to get the point across that import, the act of crossing the boundary between outside and inside the repository, is where everything really happens. Because if I'm not making that point something is really wrong.

Before we continue exploring the import and access use cases started in Part 2, let's talk about schema-specificity for a moment, because I want to be careful I'm not painting too rosy a picture with all my talk about generic XML processing.

One issue with managing XML documents is the sensitivity of the management system to the details of the schemas. In the worst case the low-level repository schema directly reflects the schema such that any change to the document schema requires a change to the repository schema which, in the worst case, requires an export and re-import of all the data in the repository, which is a dangerous and disruptive thing to have to do.

That's clearly crazy and any system that has that implication is so inappropriately overoptimized that it makes make crazy to even think about it.

We've also seen that a completely generic system for importing XML, while useful, isn't nearly complete enough to support the needs of local business proceses and business rules.

In yesterday's entry, Import is Everything, Part 2, we had just gotten to the point where we were creating, on import, some storage object metadata properties that were specific to our local policies, such as the "is schema valid" property, in the sense that we needed those properties in order to implement our policies and the business processes or user actions they implied. But those properties are still generic with respect to the document's schemas. An XML document is either schema valid or it isn't, regardless of the schema.

Because we were operating on just the XSD-defined links (schemaLocation=) some of our import processing was schema-specific but specific to a completely standard schema (XSD), not to our local schema.

But we're about to explore some use cases where we do need local schema awareness and we'll start to see where that awareness resides in the code. The short answer is, it resides in the import processing, top-of-Layer 2, and Layer 3 components. None of these should require a complete export and import of the documents involved should the schema change, although they might require reprocessing some or all of the documents in the repository (but directly from the respository).

It should be pretty clear by now that any extraction of metadata or recognition of dependency relationships that is schema-specific will of course happen in schema-specific import code. That's why the XIRUSS-T importer framework is designed the way it is, because you always have to write at least a little bit of code that is unique to your schemas and your business processes so why not make writing that code as easy as possible?

By "top-of-Layer 2" I mean code that does semantic processing of the elements inside the documents, such as link management, that sits on top of the generic facilities in Layer 2 but that may be schema specific, for whatever reason (usually optimization necessary to achieve appropriate scale or performance). For example, any full-text or element metadata index is a Layer 2 component. You can implement a completely generic, schema-independent indexing mechanism but for non-trivial document volumes and/or sophisticated schemas you will very likely want to tailor the index to both not index things you're unlikely to ever search for or to index things in a way that is more abstract than the raw XML syntax (I'll talk about these in more detail when I get around to full-text indexing of XML as a primary topic). To implement these specializations you'll need to tailor the indexer and possibly the UI for using the index in ways that are schema specific. No getting around it.

Likewise Layer 3 is where you implement functionality that reflects specific business processes and policies, which means processes that act on the XML in the repository as well as on the storage-object and element metadata in order to do useful stuff. Much of this functionality will be schema specific to one degree or another (but not all of it, of course).

So unless you can get by with a very generic system that only implements support for standards, you will always have to create and maintain system components that are schema specific. However, there are some important characteristics of a system architected as I've outlined here:

- The core storage object repository, Layer 1, is never schema specific. This means that the importers and Layers 2 and 3 can change without ever effecting the storage objects managed in Layer 1. In particular, you will never require an export and re-import if Layers 2 or 3 change.

- The code most sensitive to the schema details is closest to the edges of the repository and, in most cases, builds on more generic facilities. This has two advantages: the amount of code that is actually schema specific is minimized and the disruptive potential of changing that code is minimized.

- You get to choose, as a matter of policy and implementation, the degree of schema specificity is appropriate for a given feature. You can choose whether your full-text index is generic or tailored, the degree to which you reflect the semantics of your link types in the dependency objects created from them, and so on. So you can start small and work up as both your understanding of your business processes improves and as your schemas become more stable (assuming you're starting from scratch with brand-new schemas).

Regardless of how its architected or implemented, most of the ongoing maintenance and operating cost of an XML-aware CMS comes from reaction to changes in the schemas of the documents managed. The only question is: does the CMS design and implementation minimize that cost or does it maximize it?

Also, when you start planning for the creation and deployment of an XML-aware CMS you need to define your overall requirements such that you can clearly distinguish those requirements that are schema-specific or schema-sensitive from those that are not. For example, a requirement to impose a basic workflow onto documents is probably not schema specific but a requirement to manage a particular kind of link that is not defined in terms of any standard is schema specific.

By separating the requirements in this way you can both better estimate the immediate and long-term costs of supporting those requirements and help the implementors keep the code that is schema independent more clearly separated from the code that is schema sensitive. This will go a long way toward making your system much less expensive to maintain in the long run and much more flexible in the face of new requirements, whether they are new business processes or new schema features.

Next: More linking and stuff

Labels: XCMTDMW "xml content management" import

1 Comments:

Anonymous said...: This post is a good summary to the benefits of architecting the CMS for an XML-aware system. One of the things I struggle with is being able to convince "key stakeholders" of these benefits in terms that they understand and can digest. From an engineering perspective, I understand much of this (and you've clarified past mental struggles rather nicely, btw). From a business perspective, I guess you have to be in a good enough position or have the communication skills to convey the information to those who hold the money.; 3:58 PM

<< Home

Dr. Macro's XML Rants

Friday, July 28, 2006

XCMTDMW: Import is Everything, Part 3

1 Comments:

About Me

Previous Posts