XML is a huge mess

I have a 39 MB XML file that I wanted to process. I wasn’t expecting it to be so difficult. Writing the code, in multiple languages, was not difficult. But running the programs was a big problem.

My first attempt was a simple Haskell program, but I had to kill it after it ate over 1.3 GB (yes, 1.3 GB) of ram!

Haskell’s strings are known to be memory hogs, and the HaXml module I was using was making them even worse by not sensible decoding the UTF-8 text correctly. I decided to write a leaner Haskell program later, and switch to Perl to get the job done.

At this point I also decided to set a limit to the amount of memory the programs could consume. For a 39 MB file I hoped that 10 times that would be enough, so I rounded up and set the limit at 512 MB.

But Perl, using the XML::LibXML module, couldn’t process the file with that memory limit. I also ran a quick one-liner in Erlang, just to watch it crash out of memory too. I’m going to try some other languages to see if I can find one that can work in 512 MB.

My next useful step is to try the XML::Twig module in Perl. I’ve had good experiences with it before. It won’t be as fast as LibXML, but it probably has the best chance of surviving within my 512 MB limit. For Haskell, I think I’ll have to resort to a SAX style parser.

6 comments

  1. I’ve always hated xml, i guess it keeps everything really structured but tbh it makes everything over the top as in it can cause a lot of over the top solutions talking about standardization.

    Though to help i’ve had a very good experience with libxml2 if your in C anyways i have tried it in C++ but I’ve used the python bindings for about 10 min’s but the C api is very nice :). You have to do your own memory management as you go but it works very well.

  2. XML::SAX::Machines and possibly some of the modules from XML::Toolkit would probably fit too. Depends on what you want to do. LibXML has a SAX parser API too.

  3. I think you should have gone with the SAX approach in the first place with such a large document. Then you an build your own simple data structure that likely is much more efficient (and not as featureful) in terms of memory usage.

    Now, you talk about processing, but not exactly what you are doing. It is very rare the people need the whole DOM. If this is merely transforming the data from XML to something else, then take a look at using some XSLT. If the data is structured as a root node with many top level children, then I would approach it like so many XMPP developers and treat the document as a stream.

    If you go that route, you can make use of POE::Filter::XML (outside of POE even), and it will push parse the document (provided you feed it to how POE would feed it) returning top level document fragments from which you can do things like apply XPATH expressions since the underlying nodes that PFX spits out are XML::LibXML::Element based.

Comments are closed.