OpenOffice
6 languages, 15 bitexts
total number of files: 10983
total number of tokens: 2612156
total number of sentence fragments: 246760
The original documentation of the office package
OpenOffice.org (http://www.openoffice.org/)
contains 2014 English documents which have been partly
translated into 5 languages: French, Spanish, Swedish, German,
and Japanese. The original documentation in English comprises
about 500,000 words and translations contain between 400,000
and 500,000 words per language. All documents have been
tokenized and, except of the Spanish part, tagged with parts of
speech. The English part of the corpus has been marked with
syntactic chunks as well.
Please cite the following article if you use any part of the corpus
in your own work:
Jörg Tiedemann, Lars Nygaard, 2004,
The OPUS corpus - parallel & free.
In Proceedings of the Fourth International Conference on Language
Resources and Evaluation (LREC'04). Lisbon, Portugal
Download
Upper-right triangle: sample files (test = sentence alignment samples, language IDs = XML file samples)
Bottom-left triangle: XML-files (ces = sentence alignment files in XCES format, language IDs = gzipped tar-archives of corpus files in XML)Statistics
Number of files, tokens, and sentence fragments per language
Number of aligned sentences per target language | language | files | tokens | sentences | de | en | es | fr | jp | sv |
| de | 2014
| 474436
| 47482
| | 42903
| 37764
| 37085
| 31107
| 37947
|
| en | 2014
| 478654
| 44961
| 42903
| | 38583
| 38014
| 33143
| 38906
|
| es | 1738
| 491426
| 40009
| 37764
| 38583
| | 38477
| 33445
| 39479
|
| fr | 1739
| 496780
| 39462
| 37085
| 38014
| 38477
| | 33295
| 38726
|
| jp | 1739
| 267665
| 34167
| 31107
| 33143
| 33445
| 33295
| | 34026
|
| sv | 1739
| 403195
| 40679
| 37947
| 38906
| 39479
| 38726
| 34026
| |
|---|