Europarl3
source: http://www.statmt.org/europarl/
(release 3)
11 languages, 55 bitexts
total number of files: 7038
total number of tokens: 404290831
total number of sentence fragments:
15818708
Please cite the following article if you use any part of the corpus
in your own work:
Jörg Tiedemann, Lars Nygaard, 2004,
The OPUS corpus - parallel & free.
In Proceedings of the Fourth International Conference on Language
Resources and Evaluation (LREC'04). Lisbon, Portugal
Download
Complete download: Europarl3_0.2b.tar (3.9G)
old version: Europarl2
NEW: Dutch Europarl3 Treebank (parsed with Alpino)Upper-right triangle: txt = plain text sentence alignment files, language IDs = XML file samples)
Bottom-left triangle: XML-files (ces = sentence alignment files in XCES format, language IDs = gzipped tar-archives of corpus files in XML)Statistics
Number of files, tokens, and sentences per language
Number of sentence alignment units per language pair