Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Saturday, August 26, 2006

XIRUSS-T Update: Now WIth Direct HTTP Access to Contente

I have uploaded a new release of XIRUSS-T to Sourceforge, xiruss_t_build_20060826.

The code now includes two HTTP servers, the API server and the "viewer" server. I've provided a new top-level server runner that starts both servers. I've added some convenience functions to the Jython xiruss_client.py script to make manipulating the repository contents a little easier.

This release adds support for direct HTTP-based access to version content such that you can now import XML documents and then access them via URL from any HTTP-aware tool (i.e., a Web browser, an XSLT processor, an editor like oXygen XML, etc.) such that references to other documents will be resolved against the repository correctly.

For example, using the Jython client as a helper, I imported a directory containing an XML document that uses another schema that imports yet another schema (which is in turn made up of many small parts). The whole lot gets imported (via the built-in directory importer). Having done that, I then navigated via the new HTTP server (started on port 9091 by default) to the version that is an imported XML document. Clicking on the link from the snapshot view to the version you get the content of the version in the browser window. There you can see that the pointers to, for example, schemas, have been rewritten as references to resource IDs with resolution policy names as URL parameters.

I then copy the URL of the version to the clipboard, open oXygenXML editor and do "Open URL". I paste the copied URL into the box and do open. The version is opened in the editor. I then push the "validate" button and the document is validated against its schema accessed directly from the repository.

Not earth shattering functionality but it's a big milestone for XIRUSS.

The HTTP viewer server is very crude and unsophisticated--I'm not a Web guy and have not put any real effort into making it look pretty--it's just a way to demonstrate accessibility of versions.

This is the minimal functionality needed to make the XML versions stored in a XIRUSS repository directly usable without any sort of explicit export action.

Note that being able to actually edit a version through a tool like oXygenXML would require either implementing the necessary WebDAV protocols or providing a plug-in that works via the XIRUSS client API. The last time I looked into implementing WebDAV it appeared to be harder than I expected so I didn't do it. At the time I couldn't find a nice layered WebDAV implementation that would have been quick to adapt to my stuff. That might be different now, I don't know.

Finally, I think that my approach to the rewritten URLs needs to be thought through carefully. The current approach works but it binds the resolution policy into the version content and I think that is wrong. You should be able to change the resolution policy for a dependency without modifying the version (the current code reflects my initial implementation from a couple of years ago). I think the right thing to do is to point dependency objects but then that imposes some requirements for dependency existence that the repository model currently doesn't impose. So I have to think it through. But definitely what I'm doing now is not 100% correct.

Toward that end I think my next task will be implement an Eclipse plug-in that provides more sophisticated access to the repository and enables direct editing of new versions through Eclipse-integrated editors. I don't think this will be too hard, certainly no harder than doing my crude HTTP stuff is and will provide a much nicer interface overall.

Labels:

Thursday, August 24, 2006

Office Open XML: Good or Evil?

In response to a prospect's alleged comment that "Word will save tables as CALS tables" (or something to that effect--I got the comment second or third hand) I downloaded the Office 2007 Beta and started looking into the whole Office Open XML thing.

First, as far as I can tell from a little hands-on testing and Google searches, there's no built-in support for CALS or OASIS Exchange tables in Office 2007, at least in the beta. The table markup in Office Open XML is definitely the same Word ML stuff from Office 2003.

I also read some some of the commentary on both sides of this issue and found it amusing. It's amusing because it's just so typical of everyone involved.

My feelings about Microsoft as an enterprise are no secret, but I'll outline them here:

- As a true blue legacy IBMer I was raised with a built-in hatred of Microsoft (I lived through the whole Windows vs. OS/2 times, having started at IBM about the time the IBM XT was released). I try not to let this youthful indoctrination color my objective analyses too much.

- I feel strongly that enterprises should compete on value not proprietary lock-in and therefore have many objections to Microsoft's core business practices. This is particularly frustrating to me because of my next opinion.

- Microsoft has lots of smart people who can and do create excellent software. That is, Microsoft is more than capable of competing on value alone, at least now that it has established market dominance. Of course there is the issue of free vs licensed software that does throw a wrinkle into this equation--if OpenOffice is free how does Microsoft get legitimate revenue in a value-only competition? They would have to offer enough extra value to make it worth paying for. In fact they probably do but it would be a leap of faith for them to go that route (although Open Office XML may in fact represent an unavoidable move in that direction anyway, see below).

- Microsoft also has lots of people, smart or not, who make totally boneheaded design and implementation decisions that then get baked into products forever. I'm thinking specifically of the fact that Word has not been able to manage the auto-numbering of nested numbered lists since Version 2 (and maybe not before then). Some of this is just people not thinking it through, as always happens in software development, but I think a lot of it is a corporate culture of "get it out quick will fix it in the next release", that is not valuing engineering quality quite as much as I think they should (which really means, caring more about maximizing revenue than about providing the best possible solutions to customers--which if you're a stockholder is a good thing but if you're a user is a bad thing [that being one of the essential problems with Capitalism as an economic system ]).

To a large degree this makes Microsoft no different from most software companies. The difference of course is Microsoft's monopoly position in both operating systems and office software--it paints a big target on their backs. But Microsoft isn't doing anything that IBM didn't do for 20 years before the PC came out.

I used to rant about how evil MS Office (and in particular MS Word) was as a proprietary format--it locked your data into a format you didn't own and over which you had no control. That was definitely bad and anyone who accepted that agreement was a dupe and fool. This led of course to discussions of why (at the time) SGML was A Better Way. And what tool were the slides for those presentations done in almost without exception? Of course it was PowerPoint. [I did try on occasion to hack my own SGML-based presentation systems but I never had the time or tools to make it really work and I had to be able to interoperate with my less-enlightened colleagues.]

I must also confess that after years of resisting I got an XBox and actually subscribe to Official XBox Magazine (and Lego Star Wars II is going to ROCK). So clearly when they want to do it right Microsoft can: they're big, they've got lots of talent at their disposal. In short, they can choose to do things however they want to.

And I'll just add that for all the rantings I've spewed about Bill Gates and his evil business practices, the Bill and Melinda Gates Foundation demonstrates that he's actually got a heart and is actively trying to do serious good for the world, so full props to Mr. Bill for putting his billions to use.

Oh, and I hate MS Word with the fiery passion of a thousand burning suns. I'd sooner chew off my own arm than spend any time actually authoring words in Word. I've spent so many years authoring XML that having to deal with $*%&# like doing a backspace at the end of a paragraph destroys its formatting with no good way to get it back or the complete inability to do autonumbering and any other number of just stupid things that people tolerate day and after day for reasons that I can't understand and the egregious waste of productivity that I've observed in my own XML-steeped colleagues who are literally sitting next to me just makes me want to SCREAM. But that's just me.

So what about Office 2007 and Office Open XML?

I'm not going to bother to form a technical opinion about the relative merits of, for example, ODF and OOX because it just doesn't matter. I mean really. At the end of the day the people who create Word documents (poor bastards) or spreadsheets or presentations are the ones who care and they only care about whether they can get the work done reasonably quickly and does it look right? They don't care about formats or XML data islands or how metadata is stored relative to the core content. They also don't, by and large, care about interoperation because everybody uses Word don't they?

Microsoft has consistently demonstrated that their policy is to use standards only when it suites their interests. They were dragged into XML kicking and screaming (despite being founding members of the XML Working Group) because they knew it would be a chink in their proprietary armor that would allow wedges to be driven in. But then XML took hold and they had no choice so they embraced it, which is to their credit. That they embraced it by just XMLifying RTF is no surprise but at least they did it. And they documented it, something they never did completely with RTF (I'm sure there are those of you who remember when alternating versions of Word would fail to parse RTF that was valid per the RTF spec in different ways).

And Office 2003 even let you edit XML documents in arbitrary schemas (as long as they were in a namespace and defined using XSD schemas, a decision which is too strict but since it's my preferred policy for XML usage generally I can't really fault them). Of course this feature is largely useless for lots of reasons but hey they did it, so good for them. It demonstrated that they weren't just giving lip service to XML--they took the trouble to design and build a working arbitrary XML editor. [Now if they would just make it useful I would be happy.]

But it's hard not to see Open Office XML as a cynical attempt to satisfy the European Union and fight OpenOffice in the standards arena. All the arguments about "backward compatibility" and "we have to support all the features" are really not germain: if they really cared about there being a single universal standard for office documents they would have started with ODF and gone from there, since it already existed and is certainly close enough to what they need to be a starting point. They could have chosen to eat the cost of using MathML instead of their own math presentation markup. They could have chosen to use SVG instead of their own vector graphic language. It would have cost more, both in development time and application migration, but they could have easily said "As a company we are fully committed to open standards and are willing to do what it takes to make it work." But they didn't, for whatever reason. This saddens me a little, because there was an opportunity here that would have had some real benefit, probably, but it doesn't surprise me at all (in fact, if they had done it that would have surprised me).

I don't think any of this will materially change the day-to-day situations of people who use office software (whether MS Office or OpenOffice).

I do think it's a good thing that Office 2007 now stores its data exclusively in XML by default and I think the use of Zip files to organize the different parts which are stored as individual documents is the right thing to do and I applaud Microsoft for that decision.

And even though the ECMA standardization of Open Office XML is driven by cynical business motives, it's still a standard which means that it is truly open (in the sense that there is no license cost or exposure for using the format or implementing support for it) which will be to our benefit. I suspect that it will have the same effect that using XML did: it will force Microsoft to compete more on value than on lock-in, to engineer things a bit more carefully, and to be more consistent in their implementations from release to release.

For integrators it definitely makes it easier for us to connect things to Office (i.e., creating an X-to-OOX transform or adapter) with some assurance that the code we write today will still work five years from now.

So while I think it's pretty clear that Office Open XML was driven almost entirely by self-serving business needs I can't see how its a bad thing in general and it looks like it's actually a good thing if you recognize the reality that most office documents are in fact created in MS Office.

Now as for the new user interface--that's going to take some getting used to, but since I don't use Word it doesn't really matter to me, does it?

Labels:

Tuesday, August 22, 2006

MarkLogic: Integrated with Xiruss in No Time At All

For a project at work I've started evaluating the MarkLogic XML search engine. So far I'm pretty impressed (although I haven't certainly stressed the software beyond just getting it running and writing a little code against it). The installation and setup is pretty straightforward. The provided user interfaces are solid and usable. The documentation is clear and informative. The Java API is logical and small. My reports from colleagues who have used it more heavily is that it is very fast. [Disclaimer: Innodata Isogen is a MarkLogic partner (as far as I know) as we are partners with almost every product vendor in the XML space. This is the first time I've personally done anything with MarkLogic. My other experience with XML-aware search and retrieval tools was over six years ago when we beat our heads against the version of Verity that was at that time integrated with Documentum and that had some serious problems, including the inability to index and retrieve elements with "." in their tag names. So I can't claim to have any basis for comparison of MarkLogic to other similar tools--I'm simply reporting on my impressions of MarkLogic. This is also my first real use of XQuery for anything.]

One of the things I like about MarkLogic is that it is focused on a specific task, indexing and retrieving XML using XQuery. It doesn't also try to be a content management server or anything and MarkLogic seems to be clear about that, which is good.

After reading the Java API documentation I realized that it would be trivial to integrate a MarkLogic server with XIRUSS-T, which I did this evening in about an hour (of which 30 minutes was spent working out how my own code worked, 20 minutes was figuring out the sequence of MarkLogic API calls to make (which I did using Jython connected to both a running XIRUSS-T server and a running MarkLogic server) and then about 10 minutes coding up my integration code.

This is pretty impressive to me because it indicates that the MarkLogic system is solid and easy to integrate (at least for the simple thing I did, but I don't see any great potential complexities other than careful error handling in what I've seen so far). It certainly passed the first gates of being easy to get running, easy to figure out how to do something useful with it, and easy to write custom code against it's API. A lot of tools don't pass the first gate and even fewer pass the second or third.

My integration with XIRUSS is very simple and not anywhere near as complete as you'd really want in a fully-realized system, but it's sufficient to allow me to throw an XQuery at any document stored in the XIRUSS repository.

To do the integration I wrote a simple XIRUSS StorageManager that is just a wrapper over any other storage manager. This wrapper creates a MarkLogic-specific StorageObjectData instance that is itself a wrapper over any other StorageObjectData implementation (which in turn manages the actual storage and access to the data content of a StorageObject version).

The MarkLogicStorageManager is constructed with the URI of a MarkLogic XDBC server (which provides remote access to a MarkLogic server via a simple Java API) and holds the top-level server access object. It is constructed with a real storage manager and just delegates to it for all the methods except setStorageObjectData(), in which it constructs a new MarkLogicStorageObjectData instance (that in turn wraps the StorageObjectData instance created by the underlying real storage manager). This approach lets you use the MarkLogic storage manager with any particular way of storing the data. The alternative would have been to directly subclass InMemoryStorageManager or FileStorageManager, but since MarkLogic doesn't care where the data is stored in the repository, the wrapper approach seems more appropriate.

The MarkLogicStorageObjectData class adds to the "close()" method of StorageObjectData (called when you're done writing to a mutable version's content) the logic to get a new MarkLogic session, create a "content" object (which indexes the storage object's content), and insert that content object into the MarkLogic repository, named in a way that maps directly to the Version object as stored in the XIRUSS repository:

public void close() throws IOException {
this.data.close();
// Now write the data to the MarkLogic repository.
StorageObject so = this.getStorageObject();
String mlUrl = HttpApiUrlConstants.VERSIONS + "/" + so.getId();
Session mlSession = this.contentSource.newSession();
ContentCreateOptions options = null;
if (so instanceof XmlStorageObject) {
options = ContentCreateOptions.newXmlInstance();
} else if (so instanceof TextStorageObject) {
options = ContentCreateOptions.newTextInstance();
} else {
options = ContentCreateOptions.newBinaryInstance();
}
logger.debug("Creating MarkLogic content object for version " + so.getId() + ": " + so.getName());
Content content = ContentFactory.newContent(mlUrl, this.getInputStream(), options);
logger.debug("Content object created");
try {
logger.debug("Inserting content into MarkLogic server...");
mlSession.insertContent(content);
logger.debug("Content inserted");
} catch (RequestException e) {
logger.error(e);
throw new IOException("Exception putting content into MarkLogic server: " + e.getMessage());
}
}


I then created a subclass of JettyXirussHttpApiRunner that does nothing more than create a new MarkLogicStorageManager and sets it as the default storage manager for the repository:

XirussRepository rep = new XirussRepositoryDefaultImpl();
URI mlURI = new URI("xcc://admin:admin@localhost:8010/Documents");
StorageManager sm = new MarkLogicStorageManager(rep, rep.getDefaultStorageManager(), mlURI);
rep.addStorageManager(sm);
rep.setDefaultStorageManager(sm.getId());
rep.setPort(port);
MarkLogicXirussHttpApiRunner runner = new MarkLogicXirussHttpApiRunner(rep);
runner.start();


That's all there was to it. I then used my little Python XIRUSS client to import an XML document and hey presto, from the MarkLogic sample query UI (a Web page that just lets you submit arbitrary XQueries and see the results) I could query against the document I just imported.

I think this exercise certainly validates the XIRUSS design to some degree (the fact that it was that easy to bind it to MarkLogic by using the defined extension points).

To make this integration more complete I'd want to do things like reflect XIRUSS-maintained Version properties in the MarkLogic repository as appropriate (MarkLogic has the concept of arbitrary properties associated with indexed documents), have some association between branch and snapshot visibility in XIRUSS and the equivalent security settings in MarkLogic (i.e., when you query the MarkLogic database you can only see results for Versions that are visible in your current branch and snapshot context) and that sort of thing, as well as integration of XIRUSS's schema registry with the MarkLogic schema awareness (needed to do schema-type-aware XQueries and validation through MarkLogic's built in processing support). There's also some schema-specific configuration of MarkLogic that you may need to do (such as fragementation points in documents so it can handle large document instances).

But I've certainly proven to myself that the minimal useful integration is not at all hard.

Also, MarkLogic offers both time-limited evaluation versions and a size-limited "community" version that is an ideal companion to XIRUSS-T (as a toy system).

The MarkLogic storage manager code is in the XIRUSS-T Subversion repository on SourceForge--I created it as a separate Eclipse project from the main xiruss-t code base so it's in trunk/marklogic_storage_manager).

Labels:

Sunday, August 20, 2006

XIRUSS-T Update: New Release On SourceForge

I have finally gotten the client/server code sufficiently complete and functional to make it worth formally releasing: Xiruss-t-build_20060820

This code provides two jars, xiruss-t-client.jar and xiruss-t-server.jar, as well as all the source code (including unit tests). I've also included a very simple but very handy Python script (python/xiruss_client.py) for use with Jython that makes it easy to import a file into the running repository. See the release nodes for the release on SourceForge for details.

I still need to create a little GUI for doing importing of files and directories and for navigating the repository, as well as restoring the currently broken Web-based end-user interface.

But what's released should be sufficient for anyone who is interested in looking under the hood to easily play around with a working system. And by "working" I mean "all the unit tests pass but beyond that I make no guarantees and a number of client-side methods are not yet implemented".

So I'm going to put this code down for a while (or at least not work on it quite so obsessively--my wife has been starting to give me rather dirty looks the last few days) and return to the main discussion of XML content management.

And here's a little side question: does anyone know of a quick way to translate a bunch of POJO code into the equivalent Python? I have a Web site for xiruss.org but my cut-rate hosting service only supports Python and Perl [And I'd eat a gun before I'd ever write another line of Perl]. I'd like to set up a demonstration server that people can use to put their own stuff into but I'd need a Python implementation of the server. My quick research suggests there's no such animal. I know I could do most of it with either reflection or just search and replace but I thought maybe somebody out there would have some ideas.

Labels:

Friday, August 18, 2006

XIRUSS-T Update: Can store and retrieve XML compound documents

I have finally completed the minimal implementation of support for importing and getting back XML compound documents. This means that from the client you can use the provided XML importer to import an XML compound document (any document using XInclude, an XSD schema, or an XSLT style sheet) and, on the client, request the imported versions and get a DOM from it.

This is the core functionality needed to make XIRUSS-T a useful XML-aware content management system.

I still have more testing to do and more client-side methods to implement but the system is now minimally usable for realistic XML management use cases.

My next tasks, in addition to further testing and method implementation, is to get things packaged up nicely, hack a little client GUI, and start documenting the API and code design in more detail.

Labels:

Thursday, August 10, 2006

XIRUSS-T Update: Can Write To Storage Object Via HTTP API

I've been working feverishly on getting the XIRUSS HTTP API implemented. It's been more work than I anticipated, mostly because doing the API has revealed a number of weaknesses in my original code (not surprisingly since it was hacked at top speed). Extracting interfaces took more time than I thought (Eclipse didn't do everything it should have--not sure if it is a limitation or user error). I also reorganized the code packages to make the distinction between client and server code components clearer and to make a cleaner distinction between core implementations and repository-specific code. Finally, I had to seriously rework my storage manager implementation. But now I have all that in place and I just got the test case that demonstrates that I can create a StorageObject version and put data into it and get it back out via the HTTP API. This is a major milestone. Now all I have to do is refactor the existing Importer code to use the new API and code patterns and implement any remaining client-side proxy methods that the importers require and it should all just work. Once I get that done I can get back to the discussion of versioned hyperdocument lifecycle management.

Labels:

Saturday, August 05, 2006

XIRUSS-T Update: Client Almost Done

I have achieved the milestone in my HTTP client implementation in that my client test case demonstrates that you can create new versions, commit them to a branch, and get them back again via the newly-created snapshot. This is the core functionality needed to get things into the repository and get them back. I also demonstrate support for all the methods on RepositoryObject (the base class for all objects managed by the repository). This code committed to the Sourceforge Subversion repository.

Still to do:

- Implement all the remaining methods on Version

- Implement writing to storage object versions via the API client helper

- Figure out the best model for getting a new repository and session on the client side (this is an API design question not a functionality question).

- Refactor the organization of the various interfaces and classes to clearly separate the stuff that is only relevant to servers from stuff needed by clients so that the client-side library can be as small as possible. This will also involve refactoring the abstraction layers from the core repository up through the Xiruss-specific HTTP server. There needs to be a clear abstract layer that adds in user and session awareness. I feel strongly that the core repository data model be completely generic so that it can be exposed as essentially a single-user process. Issues of multi-user support, including authentication and so forth are implementation issues that need to be able to vary among implementations. In addition, things like supporting multiple users or ensuring transaction safety and so forth are performance and scalability issues that I am explicitly not addressing in XIRUSS-T. These are things that could be addressed either on top of the base code using aspects or how the server-specific objects are implemented or at the core SnapCM object implementation level. But none of that is needed in order to provide a semantically correct distributed server and ignoring those issues here keeps the code very very simple, which is my goal.

I've been thinking about it and I realized that with XIRUSS-T I don't want to impress anyone with the dazzling complexity of my code but with the breathtaking simplicity of the underlying data model and the core implementation objects. My whole point is that these complex challenges of management of versioned XML compound documents can be met through the use of fundamentally tools used in clever ways.

Labels:

Wednesday, August 02, 2006

XIRUSS-T Update: Client Starting To Take Shape

In the unlikely event that there's somebody out there waiting breathlessly for me to continue my exploration of versioned hyperdocument management, I wanted to report on why I haven't posted in the last couple of days.

I've been working full out on implementing a usuable HTTP-based REST server API and corresponding client layer for XIRUSS-T. This required me to extract interfaces for all the core SnapCM classes (something I should have done from the start but that's test-driven development for you--until I started on the client there was no need for interfaces because there was only one implementation of each class). This also required that I refine and fix the implementation of some core SnapCM semantics. Needless to say this was an involved refactor. Thank goodness for reasonably complete unit tests, that's all I've got say.

At the moment, the XIRUSS-T code in the Subversion repository on SourceForge now has a client API layer and corresponding unit test that can connect to a running XIRUSS-T server over HTTP, create a user, get a session for that user, get the user's session again, and get the same session. The client provides proxy objects that reflect the XIRUSS abstract API (thus the need for the interfaces).

This may sound simple but it was a lot of work to get this point. Now it's pretty much just a matter of typing to get all the client-side classes and methods implemented.

Once I have the client in place then it will be easy to create scripted or graphic clients to do stuff like navigate the repository, manage imports and exports, and so on. It will also make it easy to implement Layer 3 components as distributed clients, which is the most general thing to do even if they are running on the same machine as the server.

I really like the REST approach (using normal HTTP protocols and returning the result as XML chunks). I could have used something like RMI but that felt harder, even though it's probably actually less work to implement. But there's something very comforting about being able to point a browser at the server and see the XML response right there in the browser. Once you know the URL construction rules for the API you can navigate around manually. In the case of XIRUSS you can eventually navigate to the data content of a storage object version and see it in the browser.

It also means that code in any language can connect to the server--no need to somehow provide different language bindings (or be limited to only Java clients).

So I haven't had any time to write my next post in the XCMTDMW series. But I figure most people who are interested probably need some time to catch up to me anyway....

Labels: