Ten-thousand English Compositions of Chinese Learners (the TECCL Corpus)
Download link 1: Download the TECCL corpus here.
Download link 2: Fast download for those from outside China.
A Stanford Parser version of the TECCL treebank is made available for download here (The accuracy of the parsed version has not been checked. Some people warn of the use of parsers to analyse interlanguage, esp. underchievers', English texts; others have found that novice writers tend to use simple syntax, and therefore parsers work well with learner English texts).
Key information of the TECCL Corpus
Corpus name: Ten-thousand English Compositions of Chinese Learners (the TECCL corpus) (Version 1.1)
Text contributors: Xue, Xizhe (Romanised pinyin notation of the Chinese word "learner")
Project initiator: Jiajin Xu (the National Research Centre for Foreign Language Education, Beijing Foreign Studies University)
Year of corpus creation: 2015
Formats of the corpus: Two forms of the TECCL corpus, i.e. raw texts and part-of-speech tagged texts, are available. They are stored in two folders, i.e. 01TECCL_RAW and 02TECCL_POS. The POS texts were annotated with the tag set version 7 (C7). (cf:http://ucrel.lancs.ac.uk/claws7tags.html) of the CLAWS POS tagger developed at UCREL, Lancaster University, UK.
Citation: Xue, Xizhe. 2015. Ten-thousand English Compositions of Chinese Learners (The TECCL corpus), Version 1.1. The National Research Centre for Foreign Language Education, Beijing Foreign Studies University.
The TECCL corpus: Its background and highlights
The TECCL corpus contains approximately 10,000 writing samples of Chinese EFL learners, totalling 1,817,472 words (Note: We consider as words all alphanumeric strings, including hyphenated strings, represented by the regular expression [a-zA-Z0-9-]+.). Initially, 10,127 texts were sampled from an online writing and scoring system. 262 blank texts, texts written in Chinese, translated English texts, and duplicated and/or plagiarised texts were removed by hand. As a result, the finalised version of the TECCL corpus consists of 9,865 texts. All the text contributors have agreed to share their texts for future use of academic purposes while they were submitting the texts to the online system. Further anonymisation was committed to keep the possibility of writers' identity disclosure to a minimum. The sampling frame of the corpus was drawn up by Jiajin Xu, and he too undertook all the text cleaning and POS tagging. Liangping Wu, at the early stage of the project, assisted with the text cleaning.
The TECCL corpus ‘figures prominently’ not for its size but its representativeness in the following five aspects.
1) Unlike other Chinese learner corpora available, the TECCL corpus is more up-to-date as of 2015. The material included was produced between 2011 and 2015. The corpus was compiled to mirror the Chinese EFL learners' English of the time.
2) The corpus features a wide range of topics or prompts. The rough estimation goes over 1,000 different essay topics.
3) The writers in the corpus run the gamut from elementary school to postgraduate students, undergraduates being the overwhelming majority. The number of so-called 985/211 and non-985/211 universities to a large extent corresponds to the actual proportion of Chinese universities.
4) The geographical spread of the writers in the TECCL corpus is by far the widest of all Chinese EFL learners' English corpora. The corpus encompasses text material from 32 provinces, and (autonomous) regions, including Hong Kong and Taiwan.
5) In stark contrast to other Chinese EFL learners' English corpora, the TECCL corpus comprises both texts written in class and in testing context under (time) pressure and texts written after class. The corpus even takes in some collaborative writing samples. Most previous Chinese EFL learners' English corpora are compositions produced in high-stakes standardised English tests, such as CET-4/6, TEM4/8 and PETS.
A known problem with text typography
Chinese learners have a notorious habit of typing words immediately after the commas and full stops without a space. This problem of spacing is not corrected in the final version of the corpus. Fortunately, this does not affect the computation of word tokens or the tagging of parts of speech. Users of the corpus can add a white space after the punctuations, if necessary.
The TECCL corpus can be downloaded for personal research, but not be used for any form of commercial purposes.
Please feel free to report any problems with the texts to bfsucrg (AT) sina DOT com.
More information about the corpus is available at the official site of the Corpus Research Group, National Research Centre for Foreign Language Education, Beijing Foreign Studies University, http://corpus.bfsu.edu.cn.
Web-based concordancing of the TECCL corpus is enabled at BFSU CQPweb, http://220.127.116.11/cqp/.
引文格式：薛熙哲，2015，中国学生万篇英语作文语料库（V1.1）（Ten-thousand English Compositions of Chinese Learners, Version 1.1，简称The TECCL corpus）。
语料中不合用之处，已尽力清理。若发现其他问题，请联系：bfsucrg (AT) sina DOT com。