If you are a course participant, please send the solutions to these exercises (both programs and program output) to e.f.tjong.kim.sang(at)rug.nl by Tuesday 12 May 2009 23:59 PM or put them in the digital dropbox of Nestor before that date.
This week there are four exercises, each of which is worth 10
points. A total of 30 points corresponds with grade 10
8 while 40 points corresponds with a 10
(0 points corresponded with grade 1; assignment 5.4 has been dropped).
Start all your Perl programs with the line use strict; and have my before all the first usages of variables.
A tokenizer is a program that splits text in tokens (words and punctuation signs). For example, the sentence "It's possible," said Alan Thirkettle who runs Europe's interests in the International Space Station (ISS). from the BBC would be converted to:
" It 's possible , " said Alan Thirkettle who runs Europe 's interests in the International Space Station ( ISS ) .
Here extra spaces have been added to locations of token boundaries. By convention, 's has been regarded as a single token.
Write a Perl program that reads a Dutch text, divides it in tokens (words and punctuation signs), and counts the total number of tokens as well as the number of occurrances of each separate token type (example template). The program should print the type frequencies in decending order. Apply the program to the file /home/erikt/class/cl09/cl09text.txt (also available in cl09text.zip). What results do you get?
Example output:
$ perl -w ex51.pl < /home/erikt/class/cl09/cl09text.txt | head Found: 12068422 tokens 549207 de 519323 . 456180 , 358202 van 309458 * 240753 en 225527 het 213383 een 212717 in
Hint: much of the code of exercise 3.4 can be
reused for this exercise (see answers for
exercise 3).
Note: the exact token counts and type counts depend on
the implementation of the program. You do not need to obtain exactly
these numbers or these type orders.
A bigram is a sequence of two tokens. For example, the sentence This is a test . contains four bigrams: 1. This is; 2. is a; 3. a test; and 4. test .; Modify the program from exercise 5.2 and make it output bigram counts rather than token type counts. Try to avoid having tokens of different sentences in a single bigram. Apply the program to the file /home/erikt/class/cl09/cl09text.txt . What results do you get?
Example output:
$ perl -w ex52.pl < /home/erikt/class/cl09/cl09text.txt | head Found: 11549099 bigrams 105377 van de 54223 ) * 49431 in de 40590 ) , 38628 * * 34574 van het 25970 in het 22392 ) . 21265 en de
Note: the exact bigram counts depend on the implementation of the program. You do not need to obtain exactly these bigram counts or bigram orders.
Mutual information is a statistical measure for computing the dependence between two variables. We will use it for estimating how strong the relation is between two successive tokens t1 and t2 in a text. For this purpose, the following formula can be used:
MI(t1,t2) = 2log(bigramF(t1t2) / (typeF(t1) * typeF(t2)))
Here, bigramF(t1t2) is the bigram frequency of the pair of tokens t1t2 in the text while typeF(t1) is the frequency of token type t1. The 2log function is not available in Perl but it can be simulated with the standard log function: 2log(x) = log(x)/log(2).
Change the program of exercise 5.2 to a program that computes mutual information scores. All the necessary parts can be obtained from the earlier programs: typeF(t1) scores from exercise 5.1 (in combination with the token count) and bigramF(t1t2) from exercise 5.2 (in combination with the bigram count). The only remaining task is to combine these to mutual information scores for word bigrams.
Apply the program to the file /home/erikt/class/cl09/cl09text.txt . What results do you get?
Example output:
$ perl -w ex53.pl < /home/erikt/class/cl09/cl09text.txt | head Found: 12068422 tokens and 11549099 bigrams 23.6 Rodenrijse Droogmakerij 23.6 temminckii.jpg image:Crossarchus 23.6 Viral mediated 23.6 Image:Vlieland Sea.JPG 23.6 Sveta Nedeliakathedraal 23.6 Filene Shouse 23.6 oppervlakte=Asfalt baan2=01-19 23.6 mortibus persecutorum 23.6 contra-expertiseVertaling Contra-Expertise
Note: the exact mutual information scores depend on the implementation of the program. You do not need to obtain exactly these scores, counts or orders.
This is an optional exercise. Make this exercise only if you think it is interesting and if you have some time left.
We would like to use mutual information for obtaining interesting Dutch word pairs (collocations). However, the top of the result list obtained by the program developed in the previous exercise, is not interesting at all. It probably only contains infrequent words (with frequency 1) which happen to appear next to each other in the text.
In this exercise, your task is to try to think of and implement methods for cleaning up the output obtained in the previous exercise. Would these results have been better if the words of frequency 1 had been omitted? Or all words with a frequency less than 10, 25 or 50? Can you think of other approaches to obtaining more interesting pairs in the top of the list?
For this exercise, you should submit a description of your ideas as well as the results of some experiments involving your ideas and the Perl programs developed for these experiments.