Monday, September 25, 2006

XML vs Streams

Ever get the feeling that XML sometimes makes things harder than they need to be? One of the problems that I and others have run into is that when parsing XML using SAX (or, in my case, using JAXB to unmarshall XML, which uses SAX under the hood) from an InputStream, the stream is always closed after the parsing is complete. This is surprising behaviour in Java, because you're not supposed to close streams you don't own. This can be a problem when using network sockets or when you want to write two or more unrelated XML documents to the same stream.

The root cause of this isn't anything in Java or SAX, it's the XML specification itself. There's simply no way to tell when an XML document ends. Every well-formed XML document will have an end tag, so you'd think the parser could simply stop when it reached the end tag. However, the XML specification allows for an arbitrary amount of comments and processing instructions to follow the main element. So when reading from a stream, the parser has no choice but to keep reading until it gets to the end of the stream. After reading to the end of the stream, SAX calls close on it. I believe the reason for this is that the stream has been completely read anyway, so there's no point keeping it open.

I've had a couple of cases where I've wanted to write two or more independent XML documents to the same stream. In one case, I had a thread producing XML documents and writing them to a PipedOutputStream. Another thread would read from the corresponding PipedInputStream. The second case was a save file for an application that happened to include two unrelated XML documents. Because of the "uncertain ending" feature of the XML spec, neither of these things are possible.

Not unless you use customised input and output streams, that is. After some searching, I found some classes from Xerces called WrappedOutputStream and WrappedInputStream that allowed me to do what I wanted. They actually solve the more general problem of reading data of unknown length without reading too much. I ended up writing my own implementations, but I used the same "protocol" as these classes. I can now read and write XML over streams without having to use a new stream for each document.