Dr. Macro's XML Rants: October 2006

The recent comment about XTM (XML Topic Maps) reminded me that I ought to express a thought I've been having about topic maps for a long time.

First, let me say that I've been involved with topic maps from the first moment the name was coined, way back at the CApH meeting in 1992. Our original goal was to define a simple application of HyTime that would put that very abstract and wide-ranging standard into a concrete application context that people could readily understand. The target use case was the generic representation of back-of-the-book indexes and thesauri.

If you think about a back-of-the-book (botb) index it is nothing more than a set of terms and phrases linked back to the specific content relevant to those terms. It's not a huge leap to go from the idea of a print botb index to a more general collection of terms and links into any data that can be linked to (which with HyTime was any data at all).

At its simplest a botb index is just a flat list of terms with links, with no explicit relationship between the terms. Of course, some groupings represent categorizations rather than just shortcuts (i.e., pastry/pies represents a kind-of classification (pie is a kind of pastry), while a first-level entry of "pie" with "apple" and "peach" is just a shortcut for the two entries "pie, apple" and "pie, peach". That is, in this second case, "pie" does not classify "apple" and "peach" but just forms two phrases that happen to both start with "pie".

Another characteristic of indexes is that the same concept will be represented in different ways, i.e., "pie, apple" and "apple pie".

There is also some amount of cross-concept linking via "see" and "see also" links: "pie, see also tarts".

Finally, there may be indications of controlled vocabularies: "crumble: see cobbler" where certain terms are depricated by refusing to index them directly.

Trying to abstract this a little bit leads to this basic data model for indexes:

"concept" or "topic": the abstract thing being indexed. A topic may have any number of names. Topics are objects that have well-defined identity within some bounded scope.
"name" or "alias": an arbitrary human-readable label for a topic. For example "apple pie", "pie, apple", "tarte au pomme", etc.
"association": a relationship between two topics indicating that they are related in some way, i.e.: is-a, part-of, parent-of, related-to, etc. The set of possible association types is unbounded. In the context of an index, typical relationships would be "is-a" (e.g., pie->apple, pastry->pie), "similar-to" (see-also), etc.
"instance": any data object that is linked to from a topic to indicate that the link target is in some way an instance of the concept. For example, a topic for the concept apple pie might link to an apple pie recipe as well as a picture of an apple pie as well as the Wikipedia entry for apple pies

Given this data model it should be pretty easy to see how you could represent the data in a botb index and then generate a traditional index from it (for example, for each topic you would create an index entry for each of its names, sorted appropriately, is-a relationships would imply second-level entries, and so on. You start to run into some practical problems, such as out of all the names a topic might have, which ones do you use in the index, but that can be handled by having application-specific metadata for the names (i.e., national language, use context, etc.). For example, the topic for "apple pie" might have the names "apple pie" and "apple", with the name "apple" flagged as "use for subordinate index entries", allowing you to then construct the entries:

pie
   apple
apple 
   pie

But not "apple pie" (which would be redundant with the general entry for "apple").

Thesauri lead to a similar data model.

If you look at the data model, in particular the associations, you start to see that you can easily construct arbitrarily sophisticated systems of relationships among topics. A set of is-a relationships is a taxonomy or ontology (depending on how you define those terms). A set of part-of relationships is an assembly tree or a bill of materials.

At this point, what we have would translate to a syntactically simple (in the sense there there are a relatively small set of element types, required properties, and core semantics) way of representing things like indexes, thesauri, navigation hierarchies, taxonomies, ontologies, and so on. A very useful thing, especially for interchange and interoperation.

Given this simple but powerful model you start to see that it could be usefully applied to the general problem of "metadata management", that is the definition of metadata schemas (taxonomies, ontologies, what have you) and the association of relevant metadata to specific objects (i.e., documents in a repository, Web pages, data captured through data mining, etc.). In particular, it provides a clear, standard, generic way to unilaterally apply metadata to objects. In addition, by using queries to link from topics to their instances, you can bind things based on their inherent metadata (i.e., if I have a database of recipies that are already tagged by type of dish and ingredients, I can use a query to link from my apple pie topic to any recipe instance of dish type "pie" and main ingredient "apple").

This allows you the possibility of layering any number of descriptive metadata sets over existing data sets. A very useful thing to be able to do.

From there you start to think that you could start to represent knowledge as a set of topics and associations.

This is where I think things have gone wrong in the topic map world. I realize that this is not necessarily a popular or welcome opinion: the topic map community has tried to bill itself as one of the primary players in the knowledge management domain (along with RDF and related approaches, such as OWL and whatnot). Michael Sperberg-McQueen gave an excellent closing keynote [I could stop there--it's always true] at one of the Extreme Markup conferences where he provided a hilarious comparison of RDF and Topic Maps and made it pretty clear that both were just different views of the same space and that neither was complete nor could it be. [See also Goedel's Incompleteness Theorems.] So consider this a minority dissenting voice.

And let me be clear: I think that topic maps are useful and attractive as far as they go: for the general business problem of managing metadata and associating it with data objects, it's well suited and well thought out.

Why do I think that topic maps (and anything similar, such as RDF) is not suitable for knowledge representation?

For the simple reason that knowledge representation is much more sophisticated and subtle than just topics with associations. That is, having topics with associations and a processor that can examine those is necessary but not sufficient for enabling true knowledge representation and true knowledge-based processing (that is, automatic processes that can do useful things with that knowledge, such as reliably categorize and index medical journal articles or make sense out of a vast pool of intercepted emails or analyze financial information to find market trends).

That is, "knowledge management", to be truly useful, has to start doing things that heretofore only humans could do.

I came to this understanding when I started trying to use the OpenCyc system to do reasoning on topic maps.

The Cyc system is the brainchild of Doug Lenat, who had the idea that the only way to create a true artificial intelligence was to build up a massive database of "common sense", that is facts about everything in the world. The hypothesis was that given a rich enough body of such facts and an appropriate reasoning engine that the system would be able to do useful, reliable, and unique reasoning about anything, not just the narrow domains to which expert systems had been applied at the time (this was in the mid-to-late 80's). Doug figured it would take about 10 years to build up the the initial database of facts and set about doing it by hiring people from pretty much any and all domains to start putting in facts and assertions. After about 10 years they had their first success and went from there.

[Historical note: Steve Newcomb invited Doug to give the keynote at the first HyTime conference--it was electrifying because Doug had assumed there would be conflict between the hypertext people and the expert systems people, but he was pleasantly surprised to discover that in fact we saw a powerful potential synergy in connecting authored links to the power of something like Cyc to do automatic linking to existing, undifferentiated data. We all had a wonderful dinner with Doug that night--I've been a Cyc watcher and fan ever since. It was at an Extreme Markup convention (I think) when Cycorp announced OpenCyc--we were all very pleased at that announcement.]

Anyway, for a brief moment I had a little extra time and some business need to play around with topic maps (my work assignments have never involved topic maps). I wanted to see what I would get if I applied Cyc's common-sense reasoning ability to an arbitrary topic map and what it would take to marry the two. I was going on the hypothesis that there would be a reasonably direct mapping from the data in a topic map into however Cyc holds its data. Certainly there would be a way to represent topics as objects and there should be a way to represent associations. By then associating specific topics with existing concepts in the Cyc database it should be possible to either have Cyc reason about the topic map based on what it already knows or extend its knowledge with the facts in the topic map and then apply its reasoning engine to those.

Cyc has both an XML representation format for its data as well as a Python API, both of which made getting the topic maps into Cyc easy enough. However, at the time I tried this I was limited by the limitations in the OpenCyc database, which reflected only a fraction of the total Cyc database available in the commercial product. Doh! However, I notice that OpenCyc 1.0 claims to include the entire Cyc database. That would make a big difference.

But more importantly, I quickly realized that the way Cyc represents the world is much much more sophisticated than a simple set of topics and associations. Here is a quote from the Cycorp Web site that explains the basic Cyc model:

The Cyc KB is divided into many (currently thousands of) "microtheories", each of which is essentially a bundle of assertions that share a common set of assumptions; some microtheories are focused on a particular domain of knowledge, a particular level of detail, a particular interval in time, etc. The microtheory mechanism allows Cyc to independently maintain assertions which are prima facie contradictory, and enhances the performance of the Cyc system by focusing the inferencing process.

This notion of "microtheories" and the ability to organize them in various ways reflects a degree of sophistication that goes far beyond what you get with topic maps alone. Add to that the sophistication of the reasoning heuristics and the way that the rules have been crafted and meriod other details of how the concepts and assertions are represented and you quickly start to realize that there is a lot more to "knowledge" representation than topics and associations. You also quickly realize that the mechanisms defined by the topic map specifications alone are nowwhere near enough to represent knowledge in a way that enables non-trivial automatic reasoning.

At a minimum, there's a whole other layer of semantics and descriptive metadata that has to be added to the information in a topic map to make it approach the completeness of a Cyc's knowledgebase. For topic maps to be useful for true knowledge representation these semantics and metadata would have to be defined and standardized, which is of course possible, but much much harder to do than standardizing the base topic map syntax itself (which itself took over 10 years, which is pretty remarkable considering how simple it appears to be at first glance).

Thus my conclusion that topic maps, by themselves, do not in any really meaningful way "capture knowlege". They can at best provide identifying objects for concepts, express simple facts about those concepts in relation to each other, and bind those facts to instances of the concepts. But that's it. This is information. Very useful information and a sophisticated way to capture it, but it is not knowledge.

You could of course argue that what's in Cyc is not really knowledge either, but you cannot deny that whatever is in Cyc, it's much closer to being knowledge then a topic map can be.

But I still think it would be a useful experiment to see what you get if you try to apply Cyc to arbitrary topic maps. If OpenCyc's knowledgebase is really complete then this could be quite fruitful.

One key challenge is binding the topics in the input topic map to the correct concept or micortheory in the Cyc knowledgebase. This gets you to the fundamental problem of subject identification, which is something topic maps try to address through the notion of subject identifiers. An interesting question for Cyc to try to answer would be "given two topics are these topics about the same subject?".

One of the subtleties of Cyc that really got me to realize how involved the subject is was the question of vampires. If your domain is "the real world" then of course you know that vampires don't exist (that is undead humans who drink the blood of the living) (except that some people must believe they do...hmmm). But if you are in the domain literature then of course vampires do exist because there are endless books that feature vampires as characters. So clearly any system that hopes to be able to model everything has to be able to hold at once the fact that vampires don't exist and that they do and keep the contexts in which those statements are and are not true clearly distinct in a way that still allows them to be used together ("Bella Lugosi, a real human, played an (imaginary) vampire in motion pictures." or "the vampires in Bram Stoker's Dracula follow very different rules from the vampires in Ann Rice's vampire books."). Clearly it's not sufficient to have a single subject "vampire" but multiple related subjects in different knowledge contexts.

In any case, I found it a humbling relevation and turned my attention to the more concrete challenges of automated composition and technical document authoring and management, content to leave knowledge representation to the experts.

But that's just me....

Labels: topicmap opencyc xtm

The company I work for does data conversion as a large part of its business. Some of this involves extracting content from PDF documents to turn it into XML in whatever schema the customer specifies.

This is, in the general case, a difficult, if not impossible problem in some pathological cases. This is because PDF, like PostScript on which it is based, is a graphic command language optimized for drawing glyphs (graphical representations of characters) in a sequence of two-dimensional spaces called pages.

For example, there's no requirement that any two characters have any particular relationship to each other in the PDF data stream--their visual relationship is entirely a function of their (possibly coincidental) placement on the page next to each other. Two glyphs that are adjacent on the rendered page need not be adjacent in the PDF data stream.

For the most part PDF documents are not quite this pathological but they still aren't necessarily easy to process. For example, in nicely typeset documents a sequence of characters can be expressed in the PDF as a sequence of characters and positioning values, representing the kerning between characters, something like this:

[(a)4(b)-0.35(c)-0.64(def)] TW

Rendered this will be "abc def" where the characters are closer or nearer, with a lot of space between the "c" and "d" (but no literal space character).

An obvious challenge is determining whether or not the space between the "c" and "d" should be captured as a literal space or not--that turns out to be a big problem.

Another basic problem is paragraph detection. In the PDF, each line of text is specified as one or sequences like that shown above. There's nothing in the PDF that explicitly tells you that the lines form a larger logical construct. But of course for creating marked up documents you really need to know what are the paragraphs.

This turns out to be really hard problem. And there's there's stuff like table recognition, dehyphenating words at the ends of lines, handling sentences and paragraphs that break across pages, isolating things like headers and footers, and so on, that make the problem challenging.

We've been experimenting with some commercial tools, which I can't really mention, that do a lot of this--there's obviously some clever folks at these companies. But they tend to charge a pretty hefty price for their software--I think it's comensurate with the value they provide but it still places it out of the reach of a lot of casual users.

For my own work, I needed a quick way to just get at the raw PDF data within a document, partly so I could see why the tools we are using were not always giving the answer we wanted in order to see if it was a bug or just pathological PDF data.

So I poked around for an open-source, Java-based PDF library and found PDFBox (http://pdfbox.apache.org), which appears to be a very complete library for reading and writing PDF, including things like providing the X/Y location of constructs (a challenge because it requires essentially rendering the data just like Acrobat Reader would--PDF allows quite complex chains of calculations that all contribute to the final location and orientation of any given graphical element). I haven't had a chance to really push on it but I did read over the docs and the API and some of the sample apps they provide and it looks pretty promissing.

PDFBox doesn't do everything the commercial tools do--it doesn't do the sort of synthesis of higher-level constructs that the commercial tools do (paragraph recognition, dehyphenation, etc.), which is where the core value of the commercial tools are (these are hard problems) but it looks like it provides enough raw functionality to let you develop these features to one degree or another. The problems are challenging but they're also interesting puzzles too.

Anyway, Dr. Macro says check out PDFBox.

Update 2015: PDFBox is still going strong and continues to improve. I've been using it for various things all these years and still swear by it.

I'm particularly pleased to see PDFBox because several years ago, after a very painful experience with a particularly poorly-implemented Java PDF library (which shall remain nameless), I started implementing my own but only got as far as reading and writing pages, not page components. The project (PDF4J on Sourceforge) sat idle until recently, when I deactivated it in the face of the existence of PDFBox--I don't mind having PDF4J obsoleted; far from it I'm happy to see that someone did what I didn't have time or energy to do.

Labels: pdf "pdf data extraction" pdfbox

Dr. Macro's XML Rants

Wednesday, October 11, 2006

Topic Maps, Knowledge, and OpenCyc

Wednesday, October 04, 2006

PDF Processing Fun

About Me

Disclaimer

Blogs of People Whose Opinion I Respect

Previous Posts

Archives