XML is a huge mess

I have a 39 MB XML file that I wanted to process. I wasn’t expecting it to be so difficult. Writing the code, in multiple languages, was not difficult. But running the programs was a big problem.

My first attempt was a simple Haskell program, but I had to kill it after it ate over 1.3 GB (yes, 1.3 GB) of ram!

Haskell’s strings are known to be memory hogs, and the HaXml module I was using was making them even worse by not sensible decoding the UTF-8 text correctly. I decided to write a leaner Haskell program later, and switch to Perl to get the job done.

At this point I also decided to set a limit to the amount of memory the programs could consume. For a 39 MB file I hoped that 10 times that would be enough, so I rounded up and set the limit at 512 MB.

But Perl, using the XML::LibXML module, couldn’t process the file with that memory limit. I also ran a quick one-liner in Erlang, just to watch it crash out of memory too. I’m going to try some other languages to see if I can find one that can work in 512 MB.

My next useful step is to try the XML::Twig module in Perl. I’ve had good experiences with it before. It won’t be as fast as LibXML, but it probably has the best chance of surviving within my 512 MB limit. For Haskell, I think I’ll have to resort to a SAX style parser.