I was asked to “Free the code” from my XML parsing experiment , so I will post some here. It may be a bit disappointing though, since these are only some short scripts, and they’re a bit ugly. I’ll explain the Perl one today, and do the Haskell sometime soon.
I was playing with Jim Breen’s Japanese dictionary and I wanted to make a list of the first kanji component in each entry. I wanted one result for each entry, so I used “(none)” if the entry has no kanji part. This is not a difficult problem, although XML makes it as slow and memory intensive as many difficult problems.
use XML::Twig;
my @keb = (); # for the results
sub entry {
my ($t, $e) = @_;
my $kt = "(none)";
if (my $k = $e->first_child("k_ele")) {
if(my $keb = $k->first_child("keb")) {
$kt = $keb->text();
}
}
$e->purge;
push @keb, $kt;
}
my $twig = XML::Twig->new(
twig_handlers => { entry => \&entry }
);
$twig->parsefile($ARGV[0]);
$twig->purge;
# now the results are in @keb
Using XML::Twig is quite simple. When I create the parser I tell it how to handle the elements I care about, and in this case I only care about “entry” elements. When the parser finds an entry, it calls my entry
subroutine, passing the entry’s object as the second parameter, $e
. Inside the entry
routine I can use DOM-style methods on $e
to extract the data I want. Notice that I call $e->purge
when I’ve got the data out. This tells the parser that I won’t need that element again, so it can free the memory. This is how XML::Twig manages to parse a file that most other modules can’t.