Perl’s XML::Twig

I was asked to “Free the code” from my XML parsing experiment , so I will post some here. It may be a bit disappointing though, since these are only some short scripts, and they’re a bit ugly. I’ll explain the Perl one today, and do the Haskell sometime soon.

I was playing with Jim Breen’s Japanese dictionary and I wanted to make a list of the first kanji component in each entry. I wanted one result for each entry, so I used “(none)” if the entry has no kanji part. This is not a difficult problem, although XML makes it as slow and memory intensive as many difficult problems.

use XML::Twig;
my @keb = (); # for the results

sub entry {
    my ($t, $e) = @_;
    my $kt = "(none)";
    if (my $k = $e->first_child("k_ele")) {
        if(my $keb = $k->first_child("keb")) {
            $kt = $keb->text();
        }
    }
    $e->purge;
    push @keb, $kt;
}

my $twig = XML::Twig->new(
    twig_handlers => { entry => \&entry }
);
$twig->parsefile($ARGV[0]);
$twig->purge;

# now the results are in @keb

Using XML::Twig is quite simple. When I create the parser I tell it how to handle the elements I care about, and in this case I only care about “entry” elements. When the parser finds an entry, it calls my entry subroutine, passing the entry’s object as the second parameter, $e. Inside the entry routine I can use DOM-style methods on $e to extract the data I want. Notice that I call $e->purge when I’ve got the data out. This tells the parser that I won’t need that element again, so it can free the memory. This is how XML::Twig manages to parse a file that most other modules can’t.

Beating down the XML

XML is still a huge mess, but at least now I have managed to get a few programs that can handle it with reasonable-ish memory requirements.

For Perl, as I had thought, the XML::Twig module gave me a pleasant interface and was able to easily handle the document.

For Haskell it was a little bit trickier. I used the SAX parser in HaXml, but it is not like a regular SAX parser, since Haskell is so unlike any regular language. The parser returns a lazy list of SAX events, so I had to make sure I processed the list without evaluating the whole thing into memory.

Now that I’ve dealt with the memory issue it appears that I have a speed issue to deal with next.

XML is a huge mess

I have a 39 MB XML file that I wanted to process. I wasn’t expecting it to be so difficult. Writing the code, in multiple languages, was not difficult. But running the programs was a big problem.

My first attempt was a simple Haskell program, but I had to kill it after it ate over 1.3 GB (yes, 1.3 GB) of ram!

Haskell’s strings are known to be memory hogs, and the HaXml module I was using was making them even worse by not sensible decoding the UTF-8 text correctly. I decided to write a leaner Haskell program later, and switch to Perl to get the job done.

At this point I also decided to set a limit to the amount of memory the programs could consume. For a 39 MB file I hoped that 10 times that would be enough, so I rounded up and set the limit at 512 MB.

But Perl, using the XML::LibXML module, couldn’t process the file with that memory limit. I also ran a quick one-liner in Erlang, just to watch it crash out of memory too. I’m going to try some other languages to see if I can find one that can work in 512 MB.

My next useful step is to try the XML::Twig module in Perl. I’ve had good experiences with it before. It won’t be as fast as LibXML, but it probably has the best chance of surviving within my 512 MB limit. For Haskell, I think I’ll have to resort to a SAX style parser.

Class::Accessor can has “has”

I maintain the Class::Accessor module. It appears to be used a lot, but the API is a bit ugly. In YAPC::Asia the ugly API was criticised in at least three different talks, and each time it was compared to the fashionable Moose API.

In one of these talks JRockway asked Shawn Moore how to turn a bad API into a good API, so I’m going to try that: adding antlers to Class::Accessor!

So now instead of writing:

package Foo;
use base qw(Class::Accessor);
Foo->mk_accessors(qw(alpha beta gamma));

If you prefer Moose-style you can write:

package Foo;
use Class::Accessor "antlers";
has alpha => ( is => "rw" );
has beta  => ( is => "rw" );
has gamma => ( is => "rw" );

The original API is still available, and everything is the same underneath.

It’s alive!

My blog: it’s alive!

I don’t post very often, but I’m going to try to change that. Is this my fifth attempt?

This time, to give myself a goal, I joined the Perl Ironman Challenge and I will try to blog at least once a week about Perl. So…

Perl: it’s alive!

There have been lots of reports over the last few years about Perl being dead. Those reports upset a lot of Perl mongers, and I didn’t fully understand that. Perl was not a family member, friend, or pet; so why the strong emotion? It was never really “alive”, so how did it “die”? And all these upset people were still using Perl, so they kept it breathing. And there were many more Perl users who weren’t upset, maybe because they never heard about the death.

It seems to me that Perl never died: it just became unfashionable for a while. And during the unfashionable period Perl did have some self-image issues, and maybe a lot of misdirected energy. But being unfashionable isn’t life-threatening.

YAPC::Asia 2008

They must really want to make it easy for us to attend YAPC::Asia this year: the venue is beside our apartment!

Published
Categorized as Uncategorized Tagged