PDF Processing Fun
This is, in the general case, a difficult, if not impossible problem in some pathological cases. This is because PDF, like PostScript on which it is based, is a graphic command language optimized for drawing glyphs (graphical representations of characters) in a sequence of two-dimensional spaces called pages.
For example, there's no requirement that any two characters have any particular relationship to each other in the PDF data stream--their visual relationship is entirely a function of their (possibly coincidental) placement on the page next to each other. Two glyphs that are adjacent on the rendered page need not be adjacent in the PDF data stream.
For the most part PDF documents are not quite this pathological but they still aren't necessarily easy to process. For example, in nicely typeset documents a sequence of characters can be expressed in the PDF as a sequence of characters and positioning values, representing the kerning between characters, something like this:
Rendered this will be "abc def" where the characters are closer or nearer, with a lot of space between the "c" and "d" (but no literal space character).
An obvious challenge is determining whether or not the space between the "c" and "d" should be captured as a literal space or not--that turns out to be a big problem.
Another basic problem is paragraph detection. In the PDF, each line of text is specified as one or sequences like that shown above. There's nothing in the PDF that explicitly tells you that the lines form a larger logical construct. But of course for creating marked up documents you really need to know what are the paragraphs.
This turns out to be really hard problem. And there's there's stuff like table recognition, dehyphenating words at the ends of lines, handling sentences and paragraphs that break across pages, isolating things like headers and footers, and so on, that make the problem challenging.
We've been experimenting with some commercial tools, which I can't really mention, that do a lot of this--there's obviously some clever folks at these companies. But they tend to charge a pretty hefty price for their software--I think it's comensurate with the value they provide but it still places it out of the reach of a lot of casual users.
For my own work, I needed a quick way to just get at the raw PDF data within a document, partly so I could see why the tools we are using were not always giving the answer we wanted in order to see if it was a bug or just pathological PDF data.
So I poked around for an open-source, Java-based PDF library and found PDFBox (http://pdfbox.apache.org), which appears to be a very complete library for reading and writing PDF, including things like providing the X/Y location of constructs (a challenge because it requires essentially rendering the data just like Acrobat Reader would--PDF allows quite complex chains of calculations that all contribute to the final location and orientation of any given graphical element). I haven't had a chance to really push on it but I did read over the docs and the API and some of the sample apps they provide and it looks pretty promissing.
PDFBox doesn't do everything the commercial tools do--it doesn't do the sort of synthesis of higher-level constructs that the commercial tools do (paragraph recognition, dehyphenation, etc.), which is where the core value of the commercial tools are (these are hard problems) but it looks like it provides enough raw functionality to let you develop these features to one degree or another. The problems are challenging but they're also interesting puzzles too.
Anyway, Dr. Macro says check out PDFBox.
Update 2015: PDFBox is still going strong and continues to improve. I've been using it for various things all these years and still swear by it.
I'm particularly pleased to see PDFBox because several years ago, after a very painful experience with a particularly poorly-implemented Java PDF library (which shall remain nameless), I started implementing my own but only got as far as reading and writing pages, not page components. The project (PDF4J on Sourceforge) sat idle until recently, when I deactivated it in the face of the existence of PDFBox--I don't mind having PDF4J obsoleted; far from it I'm happy to see that someone did what I didn't have time or energy to do.
Labels: pdf "pdf data extraction" pdfbox