Download a copy of CQPweb tutorial (in Chinese) here.
The corpora developed by FLERIC members can be found at this page, and many of which can be also accessed at BFSU CQPweb corpus portal with user ID 'test' and password 'test'.
iWriteBaby corpus: an 8-million-word corpus of Chinese learners' written English.
The CLIPS corpus (Chinese Learners’ Integrated Pear Stories corpus, “中国英语学习者梨子故事综合语料库”, 简称“梨子故事语料库”): a corpus of spoken and written English/Chinese narrative discourse produced by EFL college students based on the video prompt 'The Pear Stories' film. The corpus was designed and developed by Jiajin Xu. The CLIPS corpus is composed of 500,996 English words (word definition regex: [a-z0-9]+) and 315,136 Chinese characters (Chinese character definition regex: [\u4e00-\u9fa5]|[a-z0-9]+), totalling 816,132 English words and Chinese characters.
The TECCL corpus V1.1 (Ten-thousand English Compositions of Chinese Learners, Version 1.1, 中国学生万篇英语作文语料库) is a corpus of 1,817,335 words of Chinese EFL learners at different levels of schooling and from almost all over China, covering a great variety of writing prompts. The essays were produced during 2010 and 2015, some done in class and other at home. The texts of the TECCL corpus can be downloaded from here, and concordanced online from http://220.127.116.11/cqp/. A Stanford Parser version of the TECCL treebank is made available for download here (The accuracy of the parsed version has not been checked. Some people warn of the use of parsers to analyse interlanguage, esp. underchievers', English texts; others have found that novice writers tend to use simple syntax, and therefore parsers work well with learner English texts).
The Multilingual Student Translation (MUST) Chinese partner corpus. (Coordinator: Jiajin Xu; members: Yang Liu and Hui Kang). This project is under the general directorship of Sylviane Granger and Marie-Aude Lefer. The Chinese MUST component consists of 100 Chinese to English and 100 English to Chinese texts. The translation text collection was mainly coordinated by Yang Liu. Hui Kang transrcribed all the hand-written translation task sheets, uploaded the 200 text samples to the project website. The text segmentation and bilingual alignment were all done by Hui Kang.
The DEAP (Database of English for Academic Purposes) Corpus (under construction) aims to collect texts of 100 million words covering 20 or more disciplines. BioDEAP (Biology sub-corpus), EcnDEAP (Economics sub-corpus), LinDEAP (Linguistics sub-corpus), MatDEAP (Materials Science sub-corpus), MedDEAP (Medicine sub-corpus), MilDEAP (Military Science sub-corpus) and PsyDEAP (Psychology sub-corpus) have been finalised. Each sub-corpus is composed of 5 million words.
CLOB corpus: A Brown family British English corpus of one million words published largely in 2009) developed under the leadership of Jiajin Xu and Maocheng Liang. An article describing the corpus was published in the 2013 issue of ICAME Journal. Download CLOB (18.2MB). Crown and CLOB corpora based publications can be found here. Please find a detailed description of CLOB corpus at CoRD corpus resource database of Helsinki University.
Crown corpus: A Brown family American English corpus of one million words published largely in 2009, developed under the leadership of Jiajin Xu and Maocheng Liang. An article describing the corpus was published in the 2013 issue of ICAME Journal. Download Crown (18.2MB). Crown and CLOB corpora based publications can be found here. Please find a detailed description of Crown corpus at CoRD corpus resource database of Helsinki University.
MedAca (Medical English discourse of Academia) corpus -- Clinical medicine component (MedAca医学学术英语语料库-临床医学子库) contains medical English research article texts (of 18 different subject areas of clinical medicine) of five million words. The building of the corpus was proposed by Jiajin Xu and the text gathering was undertaken by a group of English teachers (namely, Feng Xin, Qi Hui, Wu Jingjing, Ye He, Wan Ling, and You Sheng) at the School of Foreign Languages, Fujian Medical University. The first release, i.e. Version 1 of the MedAca corpus was compiled in 2015. You can download MedAca 1.0 word list here. The Version 2 of the corpus was finalised on 8 August 2017. Version 1 data was incorporated as part of the new Version 2 MedAca corpus (which consists of 5,041,631 tokens and 99,765 types in 1,186 files). The MedAca V2.0 word list can be downloaded at the link. The one-million-word version MedAca corpus V1.0 is now searchable online at http://18.104.22.168/cqp/. The corpus is now renamed as MedDEAP under the parent project DEAP.
TIME Magazine Corpus (1923-2008) of about 196 million words, which was twice the size of the BYU Time magazine corpus (by Mark Davies). The text collection was obtained a few years ago and mounted to BFSU CQPweb in late 2015. The corpus size is about 196 million words.
PATTIE (Preschoolers- and Teenagers-oriented Texts in English) corpus compiled by Dr. Jie Ji. The construction of the corpus was completed in late 2014. PATTIE will be available soon via BFSU CQPweb. Download PATTIE wordlist created by PowerConc, and PATTIE wordlist created by AntConc. The corpus can be downloaded here for personal research only.
ToRCH2009 Corpus (ToRCH2009现代汉语平衡语料库): Texts of Recent CHinese corpus 2009 (A Brown family Chinese corpus of one million words) developed under the leadership of Jiajin Xu. Download ToRCH2009 here.
The BFSU DiSCUSS corpus (DiSCUSS北外平衡汉语口语语料库) (under construction): The corpus of Diversified Spoken Chinese Utterances in Social Settings (DiSCUSS) aims at a balanced corpus of spoken Chinese for public access. The corpus was designed by Jiajin Xu and compiled by Jiajin Xu and Zhe Chen.
ICC-Chinese: International Comparable Corpus - Chinese Component.
English-Chinese parallel corpora
TED English Chinese parallel corpus of speeches: 6,187,849 English words and Chinese characters, collated by Jiajin Xu based on Web Inventory of Transcribed and Translated Talks. The corpus can be downloaded here.
Conference Interpreting Corpus under construction (as of 29 March 2017).
The ECCE Corpus 1.0 (ECCE英汉社论平行语料库1.0): ECCE is pronounced as ['eki]. The ECCE (English Chinese Corpus of Editorials) corpus 1.0 was created by Linwei Yang and his MA students at Yantai University before Linwei joined the Ph.D. programme at the National Research Centre for Foreign Language Education of Beijing Foreign Studies University. The bilingual texts of ECCE were originally extracted from The Financial Times website, and sentence-aligned by Linwei's team. The corpus size of ECCE 1.0 is 238,363 English words and 424,921 Chinese characters.
The ECCE Corpus 2.0: (ECCE英汉社论平行语料库2.0). The corpus was compiled by Linwei Yang and his students at Yantai University.
燚炎英汉平行语料库 (under construction). The corpus was designed by Jiajin Xu and compiled by Jiajin Xu and Xiuling Xu.
Corpora or text collections prepared by people beyond the FLERIC team.
Download BNC XML edition from here.