The release of the ECCE (English Chinese Corpus of Editorials) Corpus 2.0 英汉社论平行语料库-北外语料库语言学

当前位置: HOME >> CORPORA >> Content

The release of the ECCE (English Chinese Corpus of Editorials) Corpus 2.0 英汉社论平行语料库

发布者： [发表时间]：2019-09-24 [来源]： [浏览次数]：

ECCE 2.0 (ECCE英汉社论平行语料库2.0) can be downloaded at https://pan.baidu.com/s/1boDXxwV

About the ECCE Corpus 2.0

The ECCE Corpus 2.0 is released as an augmented version of an earlier release of the Bilingual Financial Times Editorial Corpus (ECCE) 1.0 on 10 June 2017. 654 texts (327 English original texts aligned to their Chinese equivalent ones) were added in this update.

The ECCE Corpus 2.0 is composed of 431,977 English words and 760,836 Chinese characters. The publication dates of the updated 654 texts span from 31 March 2014 to 22 June 2017.

Please cite:

Yang, Linwei. 2017. ECCE 2.0: The English Chinese Corpus of Editorials.（杨林伟，2017，ECCE英汉社论平行语料库2.0。）

Please refer to the documentation of version 1.0 below for more background of ECCE.

About the ECCE Corpus 1.0

The ECCE (pronounced as /'eki/, which is the shorthand of the English Chinese Corpus of Editorials) corpus 1.0 was created by Linwei Yang and his MA students at Yantai University before Linwei joined the PhD progromme at the National Research Centre for Foreign Language Education of Beijing Foreign Studies University.

The bilingual texts of ECCE were originally extracted from The Financial Times website, and sentence-aligned by Linwei's team. The earlier online version of the ECCE corpus 1.0 (known as 'Bilingual FT Editorial Corpus') has been mounted at http://www.icorpus.net/application/ft/. The corpus was post-edited before it was uploaded to http://corpus.bfsu.edu.cn/CORPORA.htm by Jiajin Xu.

The publication dates of the texts span from 16 September 2009 to 21 March 2014.

The ECCE 1.0 corpus is composed of 238,363 English words and 424,921 Chinese characters.

(The token definition for English words is '[a-zA-Z0-9-]+', and '[\u4e00-\u9fa5]|[a-zA-Zａ-ｚＡ-Ｚ0-9０-９\.%％]+' for Chinese characters.)

Both plain text, encoded in UTF-8 and ANSI (GB2312, 936), and SQL database formats of the texts are provided.

The ECCE_1.0_EN_ZH_ANSI version of the ECCE corpus 1.0 can be searched with SDAU-ParaConc.

Please cite:

Yang, Linwei. 2016. ECCE 1.0: The Bilingual FT Editorial Cropus.（杨林伟，2016，ECCE英汉社论平行语料库1.0。）