OPUS - an open source parallel corpus
OPUS is an attempt to collect translated texts from the web,
to convert and align the entire collection, to add linguistic
annotation, and to provide the community with a publicly available
parallel corpus. OPUS is based on open source products and is
also delivered as an open source package. We used several tools
to compile the current corpus. (Manual corrections have not
been made.)
The OPUS collection is growing! Check this page from time to time to
see new data arriving ...
Contributions are very welcome! Please contact
j.tiedemann@rug.nl
News
- A new version of the EMEA corpus is available! (v0.3):
22 languages, 320 million tokens; also available in MOSES/GIZA++ & TMX format!
|
Search & Browse
Tools
|
Downloads & Samples:
|
Please look at the publications below for more information about OPUS.
Please cite the last or the first one in the list if you use any part of the corpus
in your own work!
Publications
- Jörg Tiedemann, 2009,
- News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces [pdf]
In N. Nicolov and K. Bontcheva and G. Angelova and R. Mitkov (eds.)
Recent Advances in Natural Language Processing (vol V), pages
237-248, John Benjamins, Amsterdam/Philadelphia
- Jörg Tiedemann, 2008,
- Synchronizing Translated Movie Subtitles.
[pdf]
In Proceedings of the
6th International Conference on Language Resources and Evaluation
(LREC'2008)
- Jörg Tiedemann, 2007,
- Building a Multilingual Parallel Subtitle Corpus.
[pdf]
In Proceedings of CLIN 17, Leuven, Belgium, 2007.
- Jörg Tiedemann, 2007,
- Improved Sentence Alignment for Movie Subtitles.
[pdf]
In Proceedings of RANLP '07, Borovets, Bulgaria, 2007.
- Jörg Tiedemann, to appear
- OPUS - an open source parallel corpus.
[pdf]
In Proceedings of the 13th Nordic Conference
on Computational Linguistics, University of
Iceland, Reykjavik, 2003.
- Jörg Tiedemann, Lars Nygaard, 2004
- The OPUS corpus - parallel & free.
[pdf]
In Proceedings of the Fourth International Conference on Language
Resources and Evaluation (LREC'04). Lisbon, Portugal, May
26-28.