Beating down the XML

XML is still a huge mess, but at least now I have managed to get a few programs that can handle it with reasonable-ish memory requirements.

For Perl, as I had thought, the XML::Twig module gave me a pleasant interface and was able to easily handle the document.

For Haskell it was a little bit trickier. I used the SAX parser in HaXml, but it is not like a regular SAX parser, since Haskell is so unlike any regular language. The parser returns a lazy list of SAX events, so I had to make sure I processed the list without evaluating the whole thing into memory.

Now that I’ve dealt with the memory issue it appears that I have a speed issue to deal with next.

7 comments

mirod says:

2009-10-07 at 16:07

I am in the process of writing a howto about XML::Twig and encodings, could you tell me in which encoding the document is (hoping for something exotic here ;–).

Thanks
Quim says:

2009-10-07 at 17:24

Free the code! :)

erm.. some examples on how you did it would be appreciated, at least for some of us who do not have any experience with Haskell at all, but for whom being able to see both implementations side-by-side would be great!
Marty says:

2009-10-08 at 00:58

My XML document is encoded in Unicode UTF-8. Isn’t everyone’s? :-)

It does contain Japanese text, though.
mirod says:

2009-10-08 at 05:40

UTF-8, how banal! And no, not everyone’s data is in Unicode. And actually, text that could be encoded in plain latin 1 is often more of a pain to deal with than shift-JIS and the likes, at least in Perl. I can always generate shift-JIS, through vim digraphs, then iconv, although it’s a write-only process as I can’t read Japanese, I was just looking for a bit of real data. It doesn’t matterthat much though, I’ll be talking about encodings at the Italian Perl Workshop first.
Pingback: Perl’s XML::Twig — バカな火星人
Marty says:

2009-10-14 at 01:37

Encodings cause so many fun problems because most don’t contain any self-identification. Do you have any good way to tell the difference between ISO-8859-1, ISO-8859-7, ISO-8859-*, or KOI8? What about EUC-JP and EUC-KR?

I know people who went searching for bugs in ftp after a data transfer from Korea to Japan “was corrupted” :-)
Liyang HU says:

2010-03-07 at 06:17

My solution is to deny the existence of anything other than UTF-8.

Comments are closed.