Help Me Obi-Wan: Java and Encodings and XML
But today I ran into a wall and all my fiddly bit colleagues are elsewhere so I thought I would ask my readers for help.
Here's what I'm trying to do:
I have XML documents with Arabic content. I read these documents into an internal data structure, do stuff, and write the result out as different XML. Should be easy.
However, I'm finding several odd things that I don't quite understand:
1. Text.getData() is *not* returning a sequence of Unicode characters, it is returning a sequence of characters that correspond one-to-one to the bytes of the UTF-8 encoding of the original Unicode characters.
That threw me because I thought XML data *was* Unicode and therefore Text.getData() should return Unicode characters, not a sequence of single-byte chars. Or have I totally misunderstood how Java manages Strings (I don't think so)?
This is solved by getting the bytes from the string returned by Text.getData() and reinterpreting them using an InputStreamReader with the encoding set to "utf-8". (Is there a better way? Have I again missed something obvious?)
2. When I save the same document as UTF-16 the DOM construction process fails with "Content not allowed in prolog", which doesn't compute because it's not conceivable that any non-trivial XML parser wouldn't handle UTF-16 correctly.
3. When re-interpreting the UTF-8 bytes into characters, it mostly works, except that at least one character, \uFE8D (Arabic Letter Alef Isolated Form), whose UTF-8 byte sequence is EF BA 8D, is reported as EF BA EF, which is not a Unicode character and is converted to \uFFFD and "?" by the input stream reader.
I suspect that I am in fact using a crappy parser but there is so much indirection and layers and IDEs and stuff that it's very difficult, at least for me, to determine which parser I'm using, much less how to control the parser I want to use. I'm developing my code using Eclipse 3.2. I've tried setting my project to both Java 1.4 and 5.0 with no change in behavior.
For this project I have the Xerces 2.9.0 library (as reported by org.apache.xerces.impl.Version) in my classpath.
Does anyone have any idea what might be going on here?
Any help or pointers on what I might be doing wrong or how to fix it?