Subscribe to Dr. Macro's XML Rants

NOTE TO TOOL OWNERS: In this blog I will occasionally make statements about products that you will take exception to. My intent is to always be factual and accurate. If I have made a statement that you consider to be incorrect or innaccurate, please bring it to my attention and, once I have verified my error, I will post the appropriate correction.

And before you get too exercised, please read the post, date 9 Feb 2006, titled "All Tools Suck".

Tuesday, April 10, 2007

Help Me Obi-Wan: Java and Encodings and XML

While I consider myself a pretty good Java programmer I don't actually do that much processing of XML with Java and so I've never fully internalized the details of SAX and JAXP and all that. Pretty much I just crib code that will get me a DOM and hope it works or get someone else to implement the fiddly bits.

But today I ran into a wall and all my fiddly bit colleagues are elsewhere so I thought I would ask my readers for help.

Here's what I'm trying to do:

I have XML documents with Arabic content. I read these documents into an internal data structure, do stuff, and write the result out as different XML. Should be easy.

However, I'm finding several odd things that I don't quite understand:

1. Text.getData() is *not* returning a sequence of Unicode characters, it is returning a sequence of characters that correspond one-to-one to the bytes of the UTF-8 encoding of the original Unicode characters.

That threw me because I thought XML data *was* Unicode and therefore Text.getData() should return Unicode characters, not a sequence of single-byte chars. Or have I totally misunderstood how Java manages Strings (I don't think so)?

This is solved by getting the bytes from the string returned by Text.getData() and reinterpreting them using an InputStreamReader with the encoding set to "utf-8". (Is there a better way? Have I again missed something obvious?)

2. When I save the same document as UTF-16 the DOM construction process fails with "Content not allowed in prolog", which doesn't compute because it's not conceivable that any non-trivial XML parser wouldn't handle UTF-16 correctly.

3. When re-interpreting the UTF-8 bytes into characters, it mostly works, except that at least one character, \uFE8D (Arabic Letter Alef Isolated Form), whose UTF-8 byte sequence is EF BA 8D, is reported as EF BA EF, which is not a Unicode character and is converted to \uFFFD and "?" by the input stream reader.

WTF?

I suspect that I am in fact using a crappy parser but there is so much indirection and layers and IDEs and stuff that it's very difficult, at least for me, to determine which parser I'm using, much less how to control the parser I want to use. I'm developing my code using Eclipse 3.2. I've tried setting my project to both Java 1.4 and 5.0 with no change in behavior.

For this project I have the Xerces 2.9.0 library (as reported by org.apache.xerces.impl.Version) in my classpath.

Does anyone have any idea what might be going on here?

Any help or pointers on what I might be doing wrong or how to fix it?

8 Comments:

Blogger qu1j0t3 said...

Tried JDOM? It will let you switch parsers.

6:27 PM  
Blogger Unknown said...

What's the encoding of the original document? And are you sure? Java should be converting everything into Unicode as it reads the document in (somewhere down beneath the layers), but it will make certain assumptions about the format of the source. If it's XML without an encoding declared, it'll assume UTF-8, but if that's not entirely correct, it could be making the wrong guesses..just a thought.

7:58 PM  
Blogger Carey Evans said...

It sounds a lot like you’re opening a FileReader on the document using the default encoding, Windows codepage 1252, and passing that to the XML parser. This would account for the not-quite byte-by-byte interpretation of the input, since 0x8D is one of the few characters undefined in Cp1252.

Sun’s documented method of parsing XML into a DOM is to get hold of a javax.xml.parsers.DocumentBuilderFactory by its newInstance() method, use that factory’s newDocumentBuilder() method to get a javax.xml.parsers.DocumentBuilder, then call the parse(...) method that suits you best. The parser should eventually open an InputStream on the filename or URI that you pass and detect the encoding itself.

To find out what you’re actually using, you should be able to pass the parser object to System.out.println(), or call the getClass() method and print that.

Have a look at http://java.sun.com/j2ee/1.4/docs/tutorial/doc/JAXPDOM3.html for Sun’s documentation and http://java.sun.com/developer/codesamples/xml.html for their examples. And as Toby said, you may have more luck with JDOM, dom4j or XOM than the default W3C DOM.

12:01 AM  
Anonymous Anonymous said...

For this kind of problem you need to be systematic.

First, is there anywhere in the code where you read or write a file either without using an auto-detect reader or without an explicit encoding that matches the document?

Such systems are dangerous because they may work in one locale with some data for years, then suddenly break when moved to a different locale or different data. (Add such a review to your standard QA or audit regime.)

A variant on this, is that you need to be aware that some Java APIs are arrays of bytes and others are arrays of (Unicode) characters: converting bytes to characters is sometimes deferred to quite late in a processing chain; look out for code that takes a byte array and treats it as character array (by casting/coercion etc.) The more your systems deals in byte arrays, the more chance that somewhere along the chain there will be a slip-up with encoding.

Second, what is the encoding of the input document...not what the xml encoding declaration says, but what do the bytes actually say? Many encodings are signature-compatable (ISO 8859-n series for example) and they won't flag any errors coming in.

For this you first check three things:
1) Is it UTF-16? Use a hex editor. Are A-Za-z characters represented as alternating nulls and ASCII characters?
2) Is it UTF-8? Use a hex editor. Do A-zA-z characters take 1 byte and Arabic characters take two bytes?
3) Is it the ISO8859 arabic? Use a hex editor. All characters take one byte, A-Za-z in ASCII bytes.
4) Is the document actually in mixed encoding? This can happen when information is merged inattentively from different sources. The data may look like UTF-16 or UTF-8 but actually be something else. This is a last resort. Check that code values for the arabic character in the original XML document, then check what you expect them to be in the encoding in the XML header. Open the document in an arabic-displaying editor and check whether it displays as expected, noting which encoding you had to use to open it (don't use IE or a browser for this, it converts to HTML and opens up another can of worms.)

Third, what does the XML encoding say? If it does not correspond with what you found, or it uses the wrong ISO8859-n then the transcoding will be wrong.

Fourth, at this stage you must just trace the document through stages to see where the problem occurs. So when the document id first read in, check that the Arabic characters are the characters you expect.

Rick Jelliffe

1:22 AM  
Blogger Chris said...

Make sure you pass the XML parser an InputStream instead of a Reader, so that it can do the necessary encoding auto-detection.

7:51 AM  
Blogger Eliot Kimber said...

My thanks to everyone's suggestions. My data was in the encodings I thought it was, had the right BOMs (or not), etc.

I'm not sure exactly what I did that fixed the problem but it did go away.

I reworked my DOM creation to this:

is = new FileInputStream(xmlFile);
System.setProperty("javax.xml.parsers.SAXParserFactory",
"org.apache.xerces.parsers.SAXParser");
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setValidating(false); factory.setNamespaceAware(true);
DocumentBuilder dp = factory.newDocumentBuilder();
Document doc = dp.parse(xmlStream);

I also reworked my handling of the result returned by Text.getData() to just treat it as a string (I had been doing what I now realize was dubious byte-level processing of the string).

Now I get the right characters and I can process UTF-8 and UTF-16 versions of the input document.

So now I can get on with the actual data processing job (I'm converting text chunks extracted from PDF created from OCR scanning into an Excel spreadsheet such that the spreadsheet cell arrangements reflect the original relative horizontal alignments of the text in the input lines).

7:57 AM  
Anonymous Anonymous said...

Thanks for this post - I was having a similar problem, and your post helped me track it down.

I didn't end up needing to change my DocumentFactory/DocumentBuilder/parse code - it was all working with utf8 data correctly, and the text was being properly written to MySQL. My problem was that the command-line mysql client I was using to check the data was not configured with the correct encoding.

Anyway, thanks for the help.

7:02 AM  
Blogger Unknown said...

Glad you solved your problem. It sounds like part of the problem was a character encoding/translating issue. For what it's worth Java String objects internally use a UTF-16 encoding. When Strings are created from raw byte streams the assumption by the String class is that the byte stream was encoded in UTF-16.

4:03 AM  

Post a Comment

<< Home