Previous | Home | Next

 

CL Perl 2009: Exercise 5


These exercises are part of a programming course taught at University of Groningen, The Netherlands.

If you are a course participant, please send the solutions to these exercises (both programs and program output) to e.f.tjong.kim.sang(at)rug.nl by Tuesday 12 May 2009 23:59 PM or put them in the digital dropbox of Nestor before that date.

This week there are four exercises, each of which is worth 10 points. A total of 30 points corresponds with grade 10 8 while 40 points corresponds with a 10 (0 points corresponded with grade 1; assignment 5.4 has been dropped).

Start all your Perl programs with the line use strict; and have my before all the first usages of variables.

Exercise 5.1

A tokenizer is a program that splits text in tokens (words and punctuation signs). For example, the sentence "It's possible," said Alan Thirkettle who runs Europe's interests in the International Space Station (ISS). from the BBC would be converted to:

" It 's possible , " said Alan Thirkettle who runs Europe 's interests in the International Space Station ( ISS ) .

Here extra spaces have been added to locations of token boundaries. By convention, 's has been regarded as a single token.

Write a Perl program that reads a Dutch text, divides it in tokens (words and punctuation signs), and counts the total number of tokens as well as the number of occurrances of each separate token type (example template). The program should print the type frequencies in decending order. Apply the program to the file /home/erikt/class/cl09/cl09text.txt (also available in cl09text.zip). What results do you get?

Example output:

$ perl -w ex51.pl < /home/erikt/class/cl09/cl09text.txt | head
Found: 12068422 tokens
549207 de
519323 .
456180 ,
358202 van
309458 *
240753 en
225527 het
213383 een
212717 in

Hint: much of the code of exercise 3.4 can be reused for this exercise (see answers for exercise 3).
Note: the exact token counts and type counts depend on the implementation of the program. You do not need to obtain exactly these numbers or these type orders.

Exercise 5.2

A bigram is a sequence of two tokens. For example, the sentence This is a test . contains four bigrams: 1. This is; 2. is a; 3. a test; and 4. test .; Modify the program from exercise 5.2 and make it output bigram counts rather than token type counts. Try to avoid having tokens of different sentences in a single bigram. Apply the program to the file /home/erikt/class/cl09/cl09text.txt . What results do you get?

Example output:

$ perl -w ex52.pl < /home/erikt/class/cl09/cl09text.txt | head
Found: 11549099 bigrams
105377 van de
 54223 ) *
 49431 in de
 40590 ) ,
 38628 * *
 34574 van het
 25970 in het
 22392 ) .
 21265 en de

Note: the exact bigram counts depend on the implementation of the program. You do not need to obtain exactly these bigram counts or bigram orders.

Exercise 5.3

Mutual information is a statistical measure for computing the dependence between two variables. We will use it for estimating how strong the relation is between two successive tokens t1 and t2 in a text. For this purpose, the following formula can be used:

MI(t1,t2) = 2log(bigramF(t1t2) / (typeF(t1) * typeF(t2)))

Here, bigramF(t1t2) is the bigram frequency of the pair of tokens t1t2 in the text while typeF(t1) is the frequency of token type t1. The 2log function is not available in Perl but it can be simulated with the standard log function: 2log(x) = log(x)/log(2).

Change the program of exercise 5.2 to a program that computes mutual information scores. All the necessary parts can be obtained from the earlier programs: typeF(t1) scores from exercise 5.1 (in combination with the token count) and bigramF(t1t2) from exercise 5.2 (in combination with the bigram count). The only remaining task is to combine these to mutual information scores for word bigrams.

Apply the program to the file /home/erikt/class/cl09/cl09text.txt . What results do you get?

Example output:

$ perl -w ex53.pl < /home/erikt/class/cl09/cl09text.txt | head
Found: 12068422 tokens and 11549099 bigrams
23.6 Rodenrijse Droogmakerij
23.6 temminckii.jpg image:Crossarchus
23.6 Viral mediated
23.6 Image:Vlieland Sea.JPG
23.6 Sveta Nedeliakathedraal
23.6 Filene Shouse
23.6 oppervlakte=Asfalt baan2=01-19
23.6 mortibus persecutorum
23.6 contra-expertiseVertaling Contra-Expertise

Note: the exact mutual information scores depend on the implementation of the program. You do not need to obtain exactly these scores, counts or orders.

Exercise 5.4*

This is an optional exercise. Make this exercise only if you think it is interesting and if you have some time left.

We would like to use mutual information for obtaining interesting Dutch word pairs (collocations). However, the top of the result list obtained by the program developed in the previous exercise, is not interesting at all. It probably only contains infrequent words (with frequency 1) which happen to appear next to each other in the text.

In this exercise, your task is to try to think of and implement methods for cleaning up the output obtained in the previous exercise. Would these results have been better if the words of frequency 1 had been omitted? Or all words with a frequency less than 10, 25 or 50? Can you think of other approaches to obtaining more interesting pairs in the top of the list?

For this exercise, you should submit a description of your ideas as well as the results of some experiments involving your ideas and the Perl programs developed for these experiments.


Previous | Home | Answers | Next
Last update: June 10, 2009. e.f.tjong.kim.sang(at)rug.nl