Previous | Home

 

NLPII 2009: Exercise 7


This exercise are part of the course Natuurlijke taalverwerking II taught at University of Groningen, The Netherlands.

Exercise: Information Extraction

If you are a course participant then please send the solutions to this exercise by e-mail to e.f.tjong.kim.sang(at)rug.nl before Tuesday 23 June 2009, 23:59.

Send in the code of your program and answers to all of the questions mentioned below.

Introduction

In this exercise, your task is to write a computer program that extracts interesting information from Dutch text. You are advised not use the FSA toolkit for this task because it will involve a lot of work. Instead you should write a program in a language of your own choice (for example, Perl, Java, C++ or Prolog).

Background information on information extraction can be found in the lecture slides (pdf) and in articles cited at these slides.

Task

Choose two different information units from the following list. So do not choose birth and death days of people but for example something about people and something about locations. Then write two programs that extract facts for these information units from Dutch text (so not from tables) in the file /home/erikt/class/n209/nlwiki.txt (or download the file via nlwiki.zip). Store the facts in a file and for each output file answer the questions specified at the end of this exercise.

Information unit list:

What does the data look like?

You will use data taken from the Dutch Wikipedia (about 6% of the encyclopedia). The data has been converted to text, tokenized and each word has been assigned a named entity tag. Five tags have been used: PER (persons), LOC (locations), ORG (organizations), MISC (other named entities such as country adjectives and boat names) and O (words that do not belong to a named entity). Here is an example of the data:

   -KEY-/O 
   Alpen/LOC 
   -DESCRIPTION-/O 
   de/O Alpen/LOC (/O van/O Latijns/LOC Alpes/LOC ,/O van/O de/O 
   stam/O alb-/O =/O wit/O )/O zijn/O een/O bergketen/O in/O Europa/LOC 
   ,/O die/O zich/O uitstrekt/O van/O de/O Franse/MISC Middellandse/MISC 
   Zeekust/MISC in/O het/O zuidwesten/O to t/O het/O Pannonisch/ORG 
   Bekken/ORG in/O het/O oosten/O ./O 

Each Wikipedia article starts with the word -KEY- followed by the title of the article. Next is the word -DESCRIPTION- followed by the text of the article. The text is tokenized which means that punctuation signs have been separated from the words and that each sentence appears on a separate line. Named entity tags have been added to the words. Two consecutive words with the same named entity tag (not O) are assumed to belong together. Note that the tags have been assigned by a computer program and contain errors (here, for example, Pannonisch Bekken should have been LOC).

Software

For this exercise, you need to write your own information extraction program in your favorite programming language. There is an example Perl program available to show how you can implement the extraction rules. The program searches for people birthdays and checks every word in the text to see if it matches the rule tag=PER+ word=( word=1\d\d\d . If the program finds a match then it prints the derived birth year followed by the name, separated by a hash. Example: 1961#Wynton Marsalis

List of resources for this exercise

Questions

For each of the output files of your two programs, answer the following questions:

  1. How many different information units did you derive?
    You may assume assume that two information units are different if they contain different characters. If your output file contains one information unit per line then use sort -u file|wc -l to find out the number of different information units.
  2. What is the accuracy of your extraction program for the information types that you have derived? That is: what percentage of the extracted information units is correct? Estimate this percentage seperately for each of the information types that you derived, for example person-birthday and company-founder.
    You can estimate these percentages by randomly selecting 20 of the information units and manually checking if they are correct, for example by looking up the information in Wikipedia. Use the script ransort (save as ransort) for selecting 20 random facts:
    perl -w ransort < file | head -20
  3. Give examples of three different types of errors made by your program and explain why the errors occurred.
  4. Give a few examples of what could be done to improve the program so that it would be able to derive more facts with greater accuracy.

Send your programs as well as the answers to these questions to e.f.tjong.kim.sang(at)rug.nl before Tuesday 23 June 2009, 23:59.


Previous | Home
Last update: June 24, 2009. e.f.tjong.kim.sang(at)rug.nl