OPUS - an open source parallel corpus
OPUS is an attempt to collect translated texts from the web,
to convert and align the entire collection, to add linguistic
annotation, and to provide the community with a publicly available
parallel corpus. OPUS is based on open source products and is
also delivered as an open source package. We used several tools
to compile the current corpus. (Manual corrections have not
been made.)
The OPUS collection is growing! Check this page from time to time to
see new data arriving ...
Contributions are very welcome! Please contact
j.tiedemann@rug.nl
News
- A new version of the EMEA corpus is available! (v0.3):
22 languages, 320 million tokens; also available in MOSES/GIZA++ & TMX format!
|
Search & Browse
Tools
|
Downloads & Samples:
|
Publications
- Jörg Tiedemann, to appear
- News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces [pdf]
To appear in N. Nicolov and K. Bontcheva and G. Angelova and R. Mitkov (eds.)
Recent Advances in Natural Language Processing (vol V), John Benjamins, Amsterdam/Philadelphia
- Jörg Tiedemann, Lars Nygaard, 2004
- The OPUS corpus - parallel & free.
[pdf]
In Proceedings of the Fourth International Conference on Language
Resources and Evaluation (LREC'04). Lisbon, Portugal, May
26-28.
- Jörg Tiedemann, to appear
- OPUS - an open source parallel corpus.
[pdf]
In Proceedings of the 13th Nordic Conference
on Computational Linguistics, University of
Iceland, Reykjavik, 2003.
- Jörg Tiedemann, 2007,
- Building a Multilingual Parallel Subtitle Corpus.
[pdf]
In Proceedings of CLIN 17, Leuven, Belgium, 2007.
- Jörg Tiedemann, 2007,
- Improved Sentence Alignment for Movie Subtitles.
[pdf]
In Proceedings of RANLP '07, Borovets, Bulgaria, 2007.