MarkLogic: Integrated with Xiruss in No Time At All
For a project at work I've started evaluating the MarkLogic XML search engine. So far I'm pretty impressed (although I haven't certainly stressed the software beyond just getting it running and writing a little code against it). The installation and setup is pretty straightforward. The provided user interfaces are solid and usable. The documentation is clear and informative. The Java API is logical and small. My reports from colleagues who have used it more heavily is that it is very fast. [Disclaimer: Innodata Isogen is a MarkLogic partner (as far as I know) as we are partners with almost every product vendor in the XML space. This is the first time I've personally done anything with MarkLogic. My other experience with XML-aware search and retrieval tools was over six years ago when we beat our heads against the version of Verity that was at that time integrated with Documentum and that had some serious problems, including the inability to index and retrieve elements with "." in their tag names. So I can't claim to have any basis for comparison of MarkLogic to other similar tools--I'm simply reporting on my impressions of MarkLogic. This is also my first real use of XQuery for anything.]
One of the things I like about MarkLogic is that it is focused on a specific task, indexing and retrieving XML using XQuery. It doesn't also try to be a content management server or anything and MarkLogic seems to be clear about that, which is good.
After reading the Java API documentation I realized that it would be trivial to integrate a MarkLogic server with XIRUSS-T, which I did this evening in about an hour (of which 30 minutes was spent working out how my own code worked, 20 minutes was figuring out the sequence of MarkLogic API calls to make (which I did using Jython connected to both a running XIRUSS-T server and a running MarkLogic server) and then about 10 minutes coding up my integration code.
This is pretty impressive to me because it indicates that the MarkLogic system is solid and easy to integrate (at least for the simple thing I did, but I don't see any great potential complexities other than careful error handling in what I've seen so far). It certainly passed the first gates of being easy to get running, easy to figure out how to do something useful with it, and easy to write custom code against it's API. A lot of tools don't pass the first gate and even fewer pass the second or third.
My integration with XIRUSS is very simple and not anywhere near as complete as you'd really want in a fully-realized system, but it's sufficient to allow me to throw an XQuery at any document stored in the XIRUSS repository.
To do the integration I wrote a simple XIRUSS StorageManager that is just a wrapper over any other storage manager. This wrapper creates a MarkLogic-specific StorageObjectData instance that is itself a wrapper over any other StorageObjectData implementation (which in turn manages the actual storage and access to the data content of a StorageObject version).
The MarkLogicStorageManager is constructed with the URI of a MarkLogic XDBC server (which provides remote access to a MarkLogic server via a simple Java API) and holds the top-level server access object. It is constructed with a real storage manager and just delegates to it for all the methods except setStorageObjectData(), in which it constructs a new MarkLogicStorageObjectData instance (that in turn wraps the StorageObjectData instance created by the underlying real storage manager). This approach lets you use the MarkLogic storage manager with any particular way of storing the data. The alternative would have been to directly subclass InMemoryStorageManager or FileStorageManager, but since MarkLogic doesn't care where the data is stored in the repository, the wrapper approach seems more appropriate.
The MarkLogicStorageObjectData class adds to the "close()" method of StorageObjectData (called when you're done writing to a mutable version's content) the logic to get a new MarkLogic session, create a "content" object (which indexes the storage object's content), and insert that content object into the MarkLogic repository, named in a way that maps directly to the Version object as stored in the XIRUSS repository:
I then created a subclass of JettyXirussHttpApiRunner that does nothing more than create a new MarkLogicStorageManager and sets it as the default storage manager for the repository:
That's all there was to it. I then used my little Python XIRUSS client to import an XML document and hey presto, from the MarkLogic sample query UI (a Web page that just lets you submit arbitrary XQueries and see the results) I could query against the document I just imported.
I think this exercise certainly validates the XIRUSS design to some degree (the fact that it was that easy to bind it to MarkLogic by using the defined extension points).
To make this integration more complete I'd want to do things like reflect XIRUSS-maintained Version properties in the MarkLogic repository as appropriate (MarkLogic has the concept of arbitrary properties associated with indexed documents), have some association between branch and snapshot visibility in XIRUSS and the equivalent security settings in MarkLogic (i.e., when you query the MarkLogic database you can only see results for Versions that are visible in your current branch and snapshot context) and that sort of thing, as well as integration of XIRUSS's schema registry with the MarkLogic schema awareness (needed to do schema-type-aware XQueries and validation through MarkLogic's built in processing support). There's also some schema-specific configuration of MarkLogic that you may need to do (such as fragementation points in documents so it can handle large document instances).
But I've certainly proven to myself that the minimal useful integration is not at all hard.
Also, MarkLogic offers both time-limited evaluation versions and a size-limited "community" version that is an ideal companion to XIRUSS-T (as a toy system).
The MarkLogic storage manager code is in the XIRUSS-T Subversion repository on SourceForge--I created it as a separate Eclipse project from the main xiruss-t code base so it's in trunk/marklogic_storage_manager).
One of the things I like about MarkLogic is that it is focused on a specific task, indexing and retrieving XML using XQuery. It doesn't also try to be a content management server or anything and MarkLogic seems to be clear about that, which is good.
After reading the Java API documentation I realized that it would be trivial to integrate a MarkLogic server with XIRUSS-T, which I did this evening in about an hour (of which 30 minutes was spent working out how my own code worked, 20 minutes was figuring out the sequence of MarkLogic API calls to make (which I did using Jython connected to both a running XIRUSS-T server and a running MarkLogic server) and then about 10 minutes coding up my integration code.
This is pretty impressive to me because it indicates that the MarkLogic system is solid and easy to integrate (at least for the simple thing I did, but I don't see any great potential complexities other than careful error handling in what I've seen so far). It certainly passed the first gates of being easy to get running, easy to figure out how to do something useful with it, and easy to write custom code against it's API. A lot of tools don't pass the first gate and even fewer pass the second or third.
My integration with XIRUSS is very simple and not anywhere near as complete as you'd really want in a fully-realized system, but it's sufficient to allow me to throw an XQuery at any document stored in the XIRUSS repository.
To do the integration I wrote a simple XIRUSS StorageManager that is just a wrapper over any other storage manager. This wrapper creates a MarkLogic-specific StorageObjectData instance that is itself a wrapper over any other StorageObjectData implementation (which in turn manages the actual storage and access to the data content of a StorageObject version).
The MarkLogicStorageManager is constructed with the URI of a MarkLogic XDBC server (which provides remote access to a MarkLogic server via a simple Java API) and holds the top-level server access object. It is constructed with a real storage manager and just delegates to it for all the methods except setStorageObjectData(), in which it constructs a new MarkLogicStorageObjectData instance (that in turn wraps the StorageObjectData instance created by the underlying real storage manager). This approach lets you use the MarkLogic storage manager with any particular way of storing the data. The alternative would have been to directly subclass InMemoryStorageManager or FileStorageManager, but since MarkLogic doesn't care where the data is stored in the repository, the wrapper approach seems more appropriate.
The MarkLogicStorageObjectData class adds to the "close()" method of StorageObjectData (called when you're done writing to a mutable version's content) the logic to get a new MarkLogic session, create a "content" object (which indexes the storage object's content), and insert that content object into the MarkLogic repository, named in a way that maps directly to the Version object as stored in the XIRUSS repository:
public void close() throws IOException {
this.data.close();
// Now write the data to the MarkLogic repository.
StorageObject so = this.getStorageObject();
String mlUrl = HttpApiUrlConstants.VERSIONS + "/" + so.getId();
Session mlSession = this.contentSource.newSession();
ContentCreateOptions options = null;
if (so instanceof XmlStorageObject) {
options = ContentCreateOptions.newXmlInstance();
} else if (so instanceof TextStorageObject) {
options = ContentCreateOptions.newTextInstance();
} else {
options = ContentCreateOptions.newBinaryInstance();
}
logger.debug("Creating MarkLogic content object for version " + so.getId() + ": " + so.getName());
Content content = ContentFactory.newContent(mlUrl, this.getInputStream(), options);
logger.debug("Content object created");
try {
logger.debug("Inserting content into MarkLogic server...");
mlSession.insertContent(content);
logger.debug("Content inserted");
} catch (RequestException e) {
logger.error(e);
throw new IOException("Exception putting content into MarkLogic server: " + e.getMessage());
}
}
I then created a subclass of JettyXirussHttpApiRunner that does nothing more than create a new MarkLogicStorageManager and sets it as the default storage manager for the repository:
XirussRepository rep = new XirussRepositoryDefaultImpl();
URI mlURI = new URI("xcc://admin:admin@localhost:8010/Documents");
StorageManager sm = new MarkLogicStorageManager(rep, rep.getDefaultStorageManager(), mlURI);
rep.addStorageManager(sm);
rep.setDefaultStorageManager(sm.getId());
rep.setPort(port);
MarkLogicXirussHttpApiRunner runner = new MarkLogicXirussHttpApiRunner(rep);
runner.start();
That's all there was to it. I then used my little Python XIRUSS client to import an XML document and hey presto, from the MarkLogic sample query UI (a Web page that just lets you submit arbitrary XQueries and see the results) I could query against the document I just imported.
I think this exercise certainly validates the XIRUSS design to some degree (the fact that it was that easy to bind it to MarkLogic by using the defined extension points).
To make this integration more complete I'd want to do things like reflect XIRUSS-maintained Version properties in the MarkLogic repository as appropriate (MarkLogic has the concept of arbitrary properties associated with indexed documents), have some association between branch and snapshot visibility in XIRUSS and the equivalent security settings in MarkLogic (i.e., when you query the MarkLogic database you can only see results for Versions that are visible in your current branch and snapshot context) and that sort of thing, as well as integration of XIRUSS's schema registry with the MarkLogic schema awareness (needed to do schema-type-aware XQueries and validation through MarkLogic's built in processing support). There's also some schema-specific configuration of MarkLogic that you may need to do (such as fragementation points in documents so it can handle large document instances).
But I've certainly proven to myself that the minimal useful integration is not at all hard.
Also, MarkLogic offers both time-limited evaluation versions and a size-limited "community" version that is an ideal companion to XIRUSS-T (as a toy system).
The MarkLogic storage manager code is in the XIRUSS-T Subversion repository on SourceForge--I created it as a separate Eclipse project from the main xiruss-t code base so it's in trunk/marklogic_storage_manager).
Labels: xiruss marklogic integration "xml search and retrieval"
0 Comments:
Post a Comment
<< Home