http://www.euromatrixplus.net/multi-un/
The MultiUN parallel corpus is extracted fromthe United Nations Website, and then cleaned and converted to XML at Language Technology Lab in DFKI GmbH (LT-DFKI), Germany. The documents were published by UN from 2000 to 2009.
For a detailed description of this corpus, please read:MultiUN: A Multilingual corpus from United Nation Documents, Andreas Eisele and Yu Chen, LREC 2010. Please cite the paper, if you use this corpus in your work.
Release v2
An updated version v2 is coming soon.
Release v1
On March 20, 2010 we released the first version of the corpus. The corpus is released as a source release with the document files and a script for text extraction.
Download
Download the files listed below for any set of languages and decompress them in the same directory. There is no need to download all of them if only a subset of the languages are needed.
English portion of MultiUN, 805 MB, 01/2000-09/2010
French portion of MultiUN, 803 MB, 01/2000-09/2010
Spanish portion of MultiUN, 675 MB, 01/2000-09/2010
Arabic portion of MultiUN, 656 MB, 01/2000-09/2010
Russian portion of MultiUN, 876 MB, 01/1997-10/2009
Chinese portion of MultiUN, 524 MB, 01/2000-09/2010
German portion of MultiUN, 14 MB, 01/2000-09/2010.
An extraction scriptextract.pyfor the MultiUN corpus can bedownloaded here.
Special release for IWSLT 2011
Upon request by the IWSLT 2011 organizer, a special release of the sentence alignment versions of the Ar-En and Zh-En MultiUN data was made available in August 2011 in order to support evaluation forIWSLT 2011.
Download
Acknowledgments
This work was supported by theEuroMatrixPlusproject funded by the European Commission (7th Framework Programme).