Beating down the XML

XML is still a huge mess, but at least now I have managed to get a few programs that can handle it with reasonable-ish memory requirements.

For Perl, as I had thought, the XML::Twig module gave me a pleasant interface and was able to easily handle the document.

For Haskell it was a little bit trickier. I used the SAX parser in HaXml, but it is not like a regular SAX parser, since Haskell is so unlike any regular language. The parser returns a lazy list of SAX events, so I had to make sure I processed the list without evaluating the whole thing into memory.

Now that I’ve dealt with the memory issue it appears that I have a speed issue to deal with next.

7 thoughts on “Beating down the XML

  1. mirod

    I am in the process of writing a howto about XML::Twig and encodings, could you tell me in which encoding the document is (hoping for something exotic here ;–).

    Thanks

  2. Quim

    Free the code! :)

    erm.. some examples on how you did it would be appreciated, at least for some of us who do not have any experience with Haskell at all, but for whom being able to see both implementations side-by-side would be great!

  3. Marty Post author

    My XML document is encoded in Unicode UTF-8. Isn’t everyone’s? :-)

    It does contain Japanese text, though.

  4. mirod

    UTF-8, how banal! And no, not everyone’s data is in Unicode. And actually, text that could be encoded in plain latin 1 is often more of a pain to deal with than shift-JIS and the likes, at least in Perl. I can always generate shift-JIS, through vim digraphs, then iconv, although it’s a write-only process as I can’t read Japanese, I was just looking for a bit of real data. It doesn’t matterthat much though, I’ll be talking about encodings at the Italian Perl Workshop first.

  5. Pingback: Perl’s XML::Twig — バカな火星人

  6. Marty Post author

    Encodings cause so many fun problems because most don’t contain any self-identification. Do you have any good way to tell the difference between ISO-8859-1, ISO-8859-7, ISO-8859-*, or KOI8? What about EUC-JP and EUC-KR?

    I know people who went searching for bugs in ftp after a data transfer from Korea to Japan “was corrupted” :-)

Leave a Reply

Your email address will not be published.