Dr. Macro's XML Rants: PDF Processing Fun

The company I work for does data conversion as a large part of its business. Some of this involves extracting content from PDF documents to turn it into XML in whatever schema the customer specifies.

This is, in the general case, a difficult, if not impossible problem in some pathological cases. This is because PDF, like PostScript on which it is based, is a graphic command language optimized for drawing glyphs (graphical representations of characters) in a sequence of two-dimensional spaces called pages.

For example, there's no requirement that any two characters have any particular relationship to each other in the PDF data stream--their visual relationship is entirely a function of their (possibly coincidental) placement on the page next to each other. Two glyphs that are adjacent on the rendered page need not be adjacent in the PDF data stream.

For the most part PDF documents are not quite this pathological but they still aren't necessarily easy to process. For example, in nicely typeset documents a sequence of characters can be expressed in the PDF as a sequence of characters and positioning values, representing the kerning between characters, something like this:

[(a)4(b)-0.35(c)-0.64(def)] TW

Rendered this will be "abc def" where the characters are closer or nearer, with a lot of space between the "c" and "d" (but no literal space character).

An obvious challenge is determining whether or not the space between the "c" and "d" should be captured as a literal space or not--that turns out to be a big problem.

Another basic problem is paragraph detection. In the PDF, each line of text is specified as one or sequences like that shown above. There's nothing in the PDF that explicitly tells you that the lines form a larger logical construct. But of course for creating marked up documents you really need to know what are the paragraphs.

This turns out to be really hard problem. And there's there's stuff like table recognition, dehyphenating words at the ends of lines, handling sentences and paragraphs that break across pages, isolating things like headers and footers, and so on, that make the problem challenging.

We've been experimenting with some commercial tools, which I can't really mention, that do a lot of this--there's obviously some clever folks at these companies. But they tend to charge a pretty hefty price for their software--I think it's comensurate with the value they provide but it still places it out of the reach of a lot of casual users.

For my own work, I needed a quick way to just get at the raw PDF data within a document, partly so I could see why the tools we are using were not always giving the answer we wanted in order to see if it was a bug or just pathological PDF data.

So I poked around for an open-source, Java-based PDF library and found PDFBox (http://pdfbox.apache.org), which appears to be a very complete library for reading and writing PDF, including things like providing the X/Y location of constructs (a challenge because it requires essentially rendering the data just like Acrobat Reader would--PDF allows quite complex chains of calculations that all contribute to the final location and orientation of any given graphical element). I haven't had a chance to really push on it but I did read over the docs and the API and some of the sample apps they provide and it looks pretty promissing.

PDFBox doesn't do everything the commercial tools do--it doesn't do the sort of synthesis of higher-level constructs that the commercial tools do (paragraph recognition, dehyphenation, etc.), which is where the core value of the commercial tools are (these are hard problems) but it looks like it provides enough raw functionality to let you develop these features to one degree or another. The problems are challenging but they're also interesting puzzles too.

Anyway, Dr. Macro says check out PDFBox.

Update 2015: PDFBox is still going strong and continues to improve. I've been using it for various things all these years and still swear by it.

I'm particularly pleased to see PDFBox because several years ago, after a very painful experience with a particularly poorly-implemented Java PDF library (which shall remain nameless), I started implementing my own but only got as far as reading and writing pages, not page components. The project (PDF4J on Sourceforge) sat idle until recently, when I deactivated it in the face of the existence of PDFBox--I don't mind having PDF4J obsoleted; far from it I'm happy to see that someone did what I didn't have time or energy to do.

Labels: pdf "pdf data extraction" pdfbox

7 Comments:

Anonymous said...: It might be the nameless, poorly impemented library you're talking about (but then again, I have had a good experience with it, so you might refer to something else), but you should check out iText: http://www.lowagie.com/iText/

It seems to have more advanced features than PDFBox, but it's been a while since I compared them. I went with iText back then.; 7:56 AM
Eliot Kimber said...: No, iText is not the nameless library.

I took a look at iText and it appears to be focused very much on creation and much less on access. As a test I tried to write a small class to simply get to the individual text commands on a page and couldn't make it work and did not find any relevant help on line despite searching the itext message archives and what not, besides a message that showed exactly what I was trying to do and asserted that it should work. Hmph.

It's quite possible I just missed something obvious but I don't think so.

So for my application PDFBox seems to be a better fit although iText might be better for people creation PDF from scratch.; 2:51 PM
Anonymous said...: You are right. PDFBox is for parsing PDF, iText for creating and manipulating PDF.
br,
Bruno (the iText guy); 4:11 AM
Anonymous said...: I think the reason why pdftransformer is so good in convertion PDFs is they use recognition technologies; 7:40 AM
Larry Kollar said...: I've had some luck using GhostScript's "ps2ascii" utility (it can deal with PDF too). If you open up the PostScript driver file (ps2ascii.ps), the file begins by describing a collection of parameters you can feed to it. Turns out that the command-line "ps2ascii" program passes -dSIMPLE — if you remove that option, you get a stream of lines starting with F (font change), P (page break), or S (string at location). Passing -dCOMPLEX provides additional types: C (color), I (image at location), and R (rectangle at location).

For non-pathological PDFs, it's not at all difficult to combine the strings to form lines, and lines to form paragraphs. I did that much with a few awk scripts. With pathological PDFs, I presume that a skilled programmer could sort first by Y and then by X to assemble strings in their proper order (after doing something to keep font associations straight).

I'm basically a jumped-up tech writer; I got interested in this after several people came by with a request to "get the Word file out of this PDF." It's fun stuff!; 8:36 AM
Anonymous said...: I don't understand why you can't mention the product names. Considering what the blog is about it would be nice to know which product is good/bad/expensive etc...

comments can be anonymous as it doesn't make any difference who I am ;); 10:27 AM
Tilman Hausherr said...: Please correct the link, PDFBox is now at https://pdfbox.apache.org/ (and is still being improved); 10:02 AM

<< Home

Dr. Macro's XML Rants

Wednesday, October 04, 2006

PDF Processing Fun

7 Comments:

About Me

Previous Posts