Working with collocates

Now that Karen is expecting me to write more Perl scripts to analyse collocates I think it’s time to install the Text::NSP module from CPAN.

Published
Categorized as Uncategorized Tagged

Perl Collocates

Karen and I were talking about linguistics and textual analysis, and how she wanted to analyse the writings of the Perl community. So, to make a start we decided to write a short Perl script to extract word level n-grams from some text so we could start looking for interesting collocates.


$n=4;
undef $/;
@txt = split /\W+/, lc <>;
for($i = 0; @txt-$i > $n;  ++$i) {
    print "@txt[$i..$i+$n]\n";
}

(N-grams of 5 elements seem to be a good size for collocates, so we set $n=4.)

To look for interesting collocates we simply piped the output of that script through sort | uniq -c | sort -n | tail . As test data I ran the script against version 2 and 3 of the GPL. In version 2 the most common n-gram was “work based on the program”; but for version 3 it was “the gnu general public license”. That isn’t a particularly interesting result, but I’m sure we will find some when we look at more than 2 source documents.

Published
Categorized as Uncategorized Tagged

Shibuya Perl Mongers テクニカルトーク#8

Karen and I went to the Shibuya.pm technical talk tonight. Most of the talks were in high-speed Japanese so we didn’t understand very much. But we need to start practising, and Perl talks are better than normal conversation because we can, at least, understand the Perl bits.

On the way home we were comparing the Tokyo tech talks to ones we have seen in Europe. There are a lot of similarities, but we noticed one trend: in Europe the focus is on how you can do something, but here it is on what you can do.

Published
Categorized as Uncategorized Tagged

Planning a YAPC::Europe talk

Karen has already mentioned our joint YAPC::Europe talk called “My First CPAN Module“. We probably should rehearse it, or at least talk about it when I’m not watching South Park. I did read the outline she wrote, and I’m sure I could easily talk for at least an hour by following her plan.

We decided to do this talk together because we have different viewpoints. I’ve uploaded modules to CPAN so it makes sense to me. Karen was able to investigate the upload process from a first time perspective and so could spot the confusing parts. We hope that the combined perspective will make the talk useful.

Published
Categorized as Uncategorized Tagged

Spork and chopsticks

At YAPC::Asia Ingy told us all about Sporx, explaining that it was a combination of Spork and Takahashi, and so should be pronounced “Sporkahashi”.  When I began to tell Karen about “Sporkahashi” she said “That was clever” when I had only mentioned the name.  Because she knew little about Spork and nothing about Takahashi she had assumed the “hashi” was 箸 instead of 橋.

Well, Karen wouldn’t have thought about the kanji characters, but she knew that “hashi” (箸) meant “chopsticks”, so she thought a “spork and chopsticks” name was a smart idea from Ingy.

I don’t think anyone else spotted that.  The “hashi” (橋) in Takahashi (高橋) means “bridge”; 高橋 is a surname that means “high bridge”.

YAPC::Asia in Tokyo

I gave two talks today in YAPC::Asia in Tokyo. Surprisingly I finished both talks in a lot less time than planned; usually I need to rush at the end to stay on schedule. I really should work out why these talks finished quickly when I spoke more slowly.

One of the talks was 混合語 (“Kongougo”) (yes, the title will look strange if you don’t have a Japanese font installed). When I gave this talk in Europe I spend some time explaining Japanese to the Europeans, and I obviously didn’t need to do that in Japan. So instead I rewrote the slides to use more Japanese. It seemed to work: the audience laughed a lot, which is really the only important thing.

Published
Categorized as Uncategorized Tagged ,

Round them up, put ’em in a field, …

4 hours is enough time to watch a good film and have a good meal. It’s also just enough time to find a very annoying bug.

My script wasn’t inserting all its stuff into the database, but it was inserting part of it so I knew it wasn’t a database connectivity problem; it was also reaching the end without complaint, so I knew it wasn’t dieing somewhere. It thought it was doing the right thing.

The code looked correct, and I had similar scripts that worked, so I starting cutting bits off (the code). Eventually I had two scripts, one working and one failing, with the following diffs:

-my ($username) = "marty" or exit 65;
+my ($username) = ($rcpt =~ /(\w+)\@$MX/) or exit 65;

Now I knew I would be here for quite a while: someone somewhere in the code was using one of those irritating regex variables without checking if their match had actually worked! AARGH!

The offending code looked something like this (I’ve simplified it so it won’t distract from the bug):

    $text =~ m/:(\w+);|==(\w+)==/;
    return $+;

So, when $text was "foobar", the match failed and $+ still contained the last bracketed part of a previous unrelated match.

The code should have been something like:

    $text =~ m/:(\w+);|==(\w+)==/  or return;
    return $+;

I do try to be thankful in all circumstances: I’m glad I was using Free Software that I can debug and patch.

Published
Categorized as Uncategorized Tagged