Now that Marty has made the decision that he will write a simple Perl script to pull collocates out of data for me I need to give him a more precise specification of a collocate. Carmen Dayrell wrote a paper on “A quantitative approach to compare collocation patterns in translated and non-translated texts” which contains a detailed section on how to decide what a collocate is.
The first step is to work out which words should be taken as nodes – but as I am interested in specific nodes, like the word “Perl”, I will not be doing this. Then we need to decide how we will define a collocate. Dayrell suggests that the collocations should occur at least 4 times to be significant with a span of up to 4 words on either side of the node. Structural boundaries in the text should also be ignored.
While Marty does this I am going to read the work that Church and Hanks did on word association norms and mutual information to see if any of that will help me get better results.