Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Wednesday, October 11, 2006

Topic Maps, Knowledge, and OpenCyc

The recent comment about XTM (XML Topic Maps) reminded me that I ought to express a thought I've been having about topic maps for a long time.

First, let me say that I've been involved with topic maps from the first moment the name was coined, way back at the CApH meeting in 1992. Our original goal was to define a simple application of HyTime that would put that very abstract and wide-ranging standard into a concrete application context that people could readily understand. The target use case was the generic representation of back-of-the-book indexes and thesauri.

If you think about a back-of-the-book (botb) index it is nothing more than a set of terms and phrases linked back to the specific content relevant to those terms. It's not a huge leap to go from the idea of a print botb index to a more general collection of terms and links into any data that can be linked to (which with HyTime was any data at all).

At its simplest a botb index is just a flat list of terms with links, with no explicit relationship between the terms. Of course, some groupings represent categorizations rather than just shortcuts (i.e., pastry/pies represents a kind-of classification (pie is a kind of pastry), while a first-level entry of "pie" with "apple" and "peach" is just a shortcut for the two entries "pie, apple" and "pie, peach". That is, in this second case, "pie" does not classify "apple" and "peach" but just forms two phrases that happen to both start with "pie".

Another characteristic of indexes is that the same concept will be represented in different ways, i.e., "pie, apple" and "apple pie".

There is also some amount of cross-concept linking via "see" and "see also" links: "pie, see also tarts".

Finally, there may be indications of controlled vocabularies: "crumble: see cobbler" where certain terms are depricated by refusing to index them directly.

Trying to abstract this a little bit leads to this basic data model for indexes:
  • "concept" or "topic": the abstract thing being indexed. A topic may have any number of names. Topics are objects that have well-defined identity within some bounded scope.
  • "name" or "alias": an arbitrary human-readable label for a topic. For example "apple pie", "pie, apple", "tarte au pomme", etc.
  • "association": a relationship between two topics indicating that they are related in some way, i.e.: is-a, part-of, parent-of, related-to, etc. The set of possible association types is unbounded. In the context of an index, typical relationships would be "is-a" (e.g., pie->apple, pastry->pie), "similar-to" (see-also), etc.
  • "instance": any data object that is linked to from a topic to indicate that the link target is in some way an instance of the concept. For example, a topic for the concept apple pie might link to an apple pie recipe as well as a picture of an apple pie as well as the Wikipedia entry for apple pies
Given this data model it should be pretty easy to see how you could represent the data in a botb index and then generate a traditional index from it (for example, for each topic you would create an index entry for each of its names, sorted appropriately, is-a relationships would imply second-level entries, and so on. You start to run into some practical problems, such as out of all the names a topic might have, which ones do you use in the index, but that can be handled by having application-specific metadata for the names (i.e., national language, use context, etc.). For example, the topic for "apple pie" might have the names "apple pie" and "apple", with the name "apple" flagged as "use for subordinate index entries", allowing you to then construct the entries:
pie
apple
apple
pie
But not "apple pie" (which would be redundant with the general entry for "apple").

Thesauri lead to a similar data model.

If you look at the data model, in particular the associations, you start to see that you can easily construct arbitrarily sophisticated systems of relationships among topics. A set of is-a relationships is a taxonomy or ontology (depending on how you define those terms). A set of part-of relationships is an assembly tree or a bill of materials.

At this point, what we have would translate to a syntactically simple (in the sense there there are a relatively small set of element types, required properties, and core semantics) way of representing things like indexes, thesauri, navigation hierarchies, taxonomies, ontologies, and so on. A very useful thing, especially for interchange and interoperation.

Given this simple but powerful model you start to see that it could be usefully applied to the general problem of "metadata management", that is the definition of metadata schemas (taxonomies, ontologies, what have you) and the association of relevant metadata to specific objects (i.e., documents in a repository, Web pages, data captured through data mining, etc.). In particular, it provides a clear, standard, generic way to unilaterally apply metadata to objects. In addition, by using queries to link from topics to their instances, you can bind things based on their inherent metadata (i.e., if I have a database of recipies that are already tagged by type of dish and ingredients, I can use a query to link from my apple pie topic to any recipe instance of dish type "pie" and main ingredient "apple").

This allows you the possibility of layering any number of descriptive metadata sets over existing data sets. A very useful thing to be able to do.

From there you start to think that you could start to represent knowledge as a set of topics and associations.

This is where I think things have gone wrong in the topic map world. I realize that this is not necessarily a popular or welcome opinion: the topic map community has tried to bill itself as one of the primary players in the knowledge management domain (along with RDF and related approaches, such as OWL and whatnot). Michael Sperberg-McQueen gave an excellent closing keynote [I could stop there--it's always true] at one of the Extreme Markup conferences where he provided a hilarious comparison of RDF and Topic Maps and made it pretty clear that both were just different views of the same space and that neither was complete nor could it be. [See also Goedel's Incompleteness Theorems.] So consider this a minority dissenting voice.

And let me be clear: I think that topic maps are useful and attractive as far as they go: for the general business problem of managing metadata and associating it with data objects, it's well suited and well thought out.

Why do I think that topic maps (and anything similar, such as RDF) is not suitable for knowledge representation?

For the simple reason that knowledge representation is much more sophisticated and subtle than just topics with associations. That is, having topics with associations and a processor that can examine those is necessary but not sufficient for enabling true knowledge representation and true knowledge-based processing (that is, automatic processes that can do useful things with that knowledge, such as reliably categorize and index medical journal articles or make sense out of a vast pool of intercepted emails or analyze financial information to find market trends).

That is, "knowledge management", to be truly useful, has to start doing things that heretofore only humans could do.

I came to this understanding when I started trying to use the OpenCyc system to do reasoning on topic maps.

The Cyc system is the brainchild of Doug Lenat, who had the idea that the only way to create a true artificial intelligence was to build up a massive database of "common sense", that is facts about everything in the world. The hypothesis was that given a rich enough body of such facts and an appropriate reasoning engine that the system would be able to do useful, reliable, and unique reasoning about anything, not just the narrow domains to which expert systems had been applied at the time (this was in the mid-to-late 80's). Doug figured it would take about 10 years to build up the the initial database of facts and set about doing it by hiring people from pretty much any and all domains to start putting in facts and assertions. After about 10 years they had their first success and went from there.

[Historical note: Steve Newcomb invited Doug to give the keynote at the first HyTime conference--it was electrifying because Doug had assumed there would be conflict between the hypertext people and the expert systems people, but he was pleasantly surprised to discover that in fact we saw a powerful potential synergy in connecting authored links to the power of something like Cyc to do automatic linking to existing, undifferentiated data. We all had a wonderful dinner with Doug that night--I've been a Cyc watcher and fan ever since. It was at an Extreme Markup convention (I think) when Cycorp announced OpenCyc--we were all very pleased at that announcement.]

Anyway, for a brief moment I had a little extra time and some business need to play around with topic maps (my work assignments have never involved topic maps). I wanted to see what I would get if I applied Cyc's common-sense reasoning ability to an arbitrary topic map and what it would take to marry the two. I was going on the hypothesis that there would be a reasonably direct mapping from the data in a topic map into however Cyc holds its data. Certainly there would be a way to represent topics as objects and there should be a way to represent associations. By then associating specific topics with existing concepts in the Cyc database it should be possible to either have Cyc reason about the topic map based on what it already knows or extend its knowledge with the facts in the topic map and then apply its reasoning engine to those.

Cyc has both an XML representation format for its data as well as a Python API, both of which made getting the topic maps into Cyc easy enough. However, at the time I tried this I was limited by the limitations in the OpenCyc database, which reflected only a fraction of the total Cyc database available in the commercial product. Doh! However, I notice that OpenCyc 1.0 claims to include the entire Cyc database. That would make a big difference.

But more importantly, I quickly realized that the way Cyc represents the world is much much more sophisticated than a simple set of topics and associations. Here is a quote from the Cycorp Web site that explains the basic Cyc model:
The Cyc KB is divided into many (currently thousands of) "microtheories", each of which is essentially a bundle of assertions that share a common set of assumptions; some microtheories are focused on a particular domain of knowledge, a particular level of detail, a particular interval in time, etc. The microtheory mechanism allows Cyc to independently maintain assertions which are prima facie contradictory, and enhances the performance of the Cyc system by focusing the inferencing process.
This notion of "microtheories" and the ability to organize them in various ways reflects a degree of sophistication that goes far beyond what you get with topic maps alone. Add to that the sophistication of the reasoning heuristics and the way that the rules have been crafted and meriod other details of how the concepts and assertions are represented and you quickly start to realize that there is a lot more to "knowledge" representation than topics and associations. You also quickly realize that the mechanisms defined by the topic map specifications alone are nowwhere near enough to represent knowledge in a way that enables non-trivial automatic reasoning.

At a minimum, there's a whole other layer of semantics and descriptive metadata that has to be added to the information in a topic map to make it approach the completeness of a Cyc's knowledgebase. For topic maps to be useful for true knowledge representation these semantics and metadata would have to be defined and standardized, which is of course possible, but much much harder to do than standardizing the base topic map syntax itself (which itself took over 10 years, which is pretty remarkable considering how simple it appears to be at first glance).

Thus my conclusion that topic maps, by themselves, do not in any really meaningful way "capture knowlege". They can at best provide identifying objects for concepts, express simple facts about those concepts in relation to each other, and bind those facts to instances of the concepts. But that's it. This is information. Very useful information and a sophisticated way to capture it, but it is not knowledge.

You could of course argue that what's in Cyc is not really knowledge either, but you cannot deny that whatever is in Cyc, it's much closer to being knowledge then a topic map can be.

But I still think it would be a useful experiment to see what you get if you try to apply Cyc to arbitrary topic maps. If OpenCyc's knowledgebase is really complete then this could be quite fruitful.

One key challenge is binding the topics in the input topic map to the correct concept or micortheory in the Cyc knowledgebase. This gets you to the fundamental problem of subject identification, which is something topic maps try to address through the notion of subject identifiers. An interesting question for Cyc to try to answer would be "given two topics are these topics about the same subject?".

One of the subtleties of Cyc that really got me to realize how involved the subject is was the question of vampires. If your domain is "the real world" then of course you know that vampires don't exist (that is undead humans who drink the blood of the living) (except that some people must believe they do...hmmm). But if you are in the domain literature then of course vampires do exist because there are endless books that feature vampires as characters. So clearly any system that hopes to be able to model everything has to be able to hold at once the fact that vampires don't exist and that they do and keep the contexts in which those statements are and are not true clearly distinct in a way that still allows them to be used together ("Bella Lugosi, a real human, played an (imaginary) vampire in motion pictures." or "the vampires in Bram Stoker's Dracula follow very different rules from the vampires in Ann Rice's vampire books."). Clearly it's not sufficient to have a single subject "vampire" but multiple related subjects in different knowledge contexts.

In any case, I found it a humbling relevation and turned my attention to the more concrete challenges of automated composition and technical document authoring and management, content to leave knowledge representation to the experts.

But that's just me....

Labels:

5 Comments:

Blogger Alex said...

"Thus my conclusion that topic maps, by themselves, do not in any really meaningful way 'capture knowlege'."

Hmm, Topic Maps isn't supposed to be the knowledge itself; it's a container and a data model with which your smart systems can pull out that represented knowledge. As you know, using Cyc as an ontology merged with your Topic Map that use it enables your software to ask rather intelligent questions from the merged map. That's its purpose, really.

Also, I get the feeling you're talking about the TMDM (Topic Maps Data Model) in most of this, but you should be aware of the TMRM (Topic Maps Reference Model) in which you can create richer, different, better models through rules-based constraints. I moved to the TMRM about a year ago, and never look back, and I think I'm doing knowledge representation. :)

7:18 PM  
Blogger Eliot Kimber said...

Note that I'm focusing on the standardization aspect: the, at least implied, claim is that just having your data in a topic map makes it useful for automatic reasoning because it's a topic map.

That simply can't be the case.

Of course, that's not to say that you couldn't augment a topic map with your own association types, metadata, and so on, to make the information as rich as what is in the Cyc knowledgebase. Of course you could, but anything like that you did today would not be a standard.

That is, just having topic maps is not enough.

I also take the existence of a general claim in the topic map community that topics maps, by their nature, capture "knowledge" as evidence that the topic map community as whole does not understand knowledge representation well enough to be the likely standardizers of any true knowledge representation mechanism.

Just saying.

12:39 PM  
Blogger Lars Marius Garshol said...

I think your use of the word "knowledge" is problematic here, mainly because everybody uses it differently. There are lots of applications that I think deserve to be called knowledge management (and that people will refer to as KM no matter what you or I think) that Topic Maps handle very well.

However, you are definitely right that to build something like Cyc on top of Topic Maps is something that the ISO standards are not even close to.

As for inferencing that is definitely possible in Topic Maps and also being done, but again, what is done is using tolog (or integration with external rule engines), and so not standardized. It is also not close to Cyc in sophistication.

As for RDF I think you are right that the same applies there, except that the RDF people do much more reasoning/inference and emphasize it more heavily. Interestingly, I would say that this takes them further away from knowledge management in the "soft" sense where one is representing knowledge for transmission between humans, although it does move them closer to automated decision-making. (How close is another question.)

6:44 AM  
Blogger Eliot Kimber said...

Yes of course the term "knowledge" is itself fuzzy--I did try to qualify it but that only takes you so far (a fundamental problem in information and knowledge representation, of course :-)).

But as you say, whatever is done with Topic Maps (or RDF) that can be reasonably considered as knowledge management is not standardized. That's my point: Topic Maps, by themselves, do not standardize knowledge management in any meaningful way.

10:05 AM  
Anonymous Anonymous said...

I have own experience in trying to integrate OpenCyc and Topic Maps and I my conclusion is a little bit different. Topic Maps are great for representing and exchanging factual knowledge. We can easily take parts of Cyc ontology and use it for defining factual assertions using topic maps. Cyc’s microtheories are handled nicely by topic map scope. Actually, scopes are more close to powerful ideas described in Doug Lenat’s article about contexts. Topic map scope also allows implementing context / microtheory lifting. Facts gathered by topic map can be easily uploaded to OpenCyc and used for inference/query processing. OpenCyc can be used as “smart” agent available for knowledge processing on a semantic grid.

3:49 PM  

Post a Comment

<< Home