If you are a course participant then please send the solutions to this exercise by e-mail to e.f.tjong.kim.sang(at)rug.nl before Tuesday 23 June 2009, 23:59.
Send in the code of your program and answers to all of the questions mentioned below.
In this exercise, your task is to write a computer program that extracts interesting information from Dutch text. You are advised not use the FSA toolkit for this task because it will involve a lot of work. Instead you should write a program in a language of your own choice (for example, Perl, Java, C++ or Prolog).
Background information on information extraction can be found in the lecture slides (pdf) and in articles cited at these slides.
Choose two different information units from the following list. So do not choose birth and death days of people but for example something about people and something about locations. Then write two programs that extract facts for these information units from Dutch text (so not from tables) in the file /home/erikt/class/n209/nlwiki.txt (or download the file via nlwiki.zip). Store the facts in a file and for each output file answer the questions specified at the end of this exercise.
Information unit list:
You will use data taken from the Dutch Wikipedia (about 6% of the encyclopedia). The data has been converted to text, tokenized and each word has been assigned a named entity tag. Five tags have been used: PER (persons), LOC (locations), ORG (organizations), MISC (other named entities such as country adjectives and boat names) and O (words that do not belong to a named entity). Here is an example of the data:
-KEY-/O Alpen/LOC -DESCRIPTION-/O de/O Alpen/LOC (/O van/O Latijns/LOC Alpes/LOC ,/O van/O de/O stam/O alb-/O =/O wit/O )/O zijn/O een/O bergketen/O in/O Europa/LOC ,/O die/O zich/O uitstrekt/O van/O de/O Franse/MISC Middellandse/MISC Zeekust/MISC in/O het/O zuidwesten/O to t/O het/O Pannonisch/ORG Bekken/ORG in/O het/O oosten/O ./O
Each Wikipedia article starts with the word -KEY- followed by the title of the article. Next is the word -DESCRIPTION- followed by the text of the article. The text is tokenized which means that punctuation signs have been separated from the words and that each sentence appears on a separate line. Named entity tags have been added to the words. Two consecutive words with the same named entity tag (not O) are assumed to belong together. Note that the tags have been assigned by a computer program and contain errors (here, for example, Pannonisch Bekken should have been LOC).
For this exercise, you need to write your own information extraction program in your favorite programming language. There is an example Perl program available to show how you can implement the extraction rules. The program searches for people birthdays and checks every word in the text to see if it matches the rule tag=PER+ word=( word=1\d\d\d . If the program finds a match then it prints the derived birth year followed by the name, separated by a hash. Example: 1961#Wynton Marsalis
For each of the output files of your two programs, answer the following questions:
Send your programs as well as the answers to these questions to e.f.tjong.kim.sang(at)rug.nl before Tuesday 23 June 2009, 23:59.