Corpora-北外语料库语言学

当前位置: HOME >> CORPORA >> Content

Corpora

发布者： [发表时间]：2019-09-17 [来源]： [浏览次数]：

CQPweb tutorial (in Chinese) (CQPweb简明图文使用手册)CQP syntax高级检索使用说明

The corpora developed by FLERIC members can be found at this page, and many of which can be also accessed at BFSU CQPweb corpus portal with user ID 'test' and password 'test'.

Learner corpora

-iWriteBaby corpus: an 8-million-word corpus of Chinese learners' written English.

-The CLIPS corpus (Chinese Learners’ Integrated Pear Stories corpus, “中国英语学习者梨子故事综合语料库”, 简称“梨子故事语料库”): a corpus of spoken and written English/Chinese narrative discourse produced by EFL college students based on the video prompt 'The Pear Stories' film. The corpus was designed and developed by Jiajin Xu. The CLIPS corpus is composed of 500,996 English words (word definition regex: [a-z0-9]+) and 315,136 Chinese characters (Chinese character definition regex: [\u4e00-\u9fa5]|[a-z0-9]+), totalling 816,132 English words and Chinese characters.

-The TECCL corpus V1.1 (Ten-thousand English Compositions of Chinese Learners, Version 1.1, 中国学生万篇英语作文语料库) is a corpus of 1,817,335 words of Chinese EFL learners at different levels of schooling and from almost all over China, covering a great variety of writing prompts. The essays were produced during 2010 and 2015, some done in class and other at home. The texts of the TECCL corpus can be downloaded from here, and concordanced online from http://114.251.154.212/cqp/. A Stanford Parser version of the TECCL treebank is made available for download here (The accuracy of the parsed version has not been checked. Some people warn of the use of parsers to analyse interlanguage, esp. underchievers', English texts; others have found that novice writers tend to use simple syntax, and therefore parsers work well with learner English texts).

-The Multilingual Student Translation (MUST) Chinese partner corpus. (Coordinator: Jiajin Xu; members: Yang Liu and Hui Kang). This project is under the general directorship of Sylviane Granger and Marie-Aude Lefer. The Chinese MUST component consists of 100 Chinese to English and 100 English to Chinese texts. The translation text collection was mainly coordinated by Yang Liu. Hui Kang transrcribed all the hand-written translation task sheets, uploaded the 200 text samples to the project website. The text segmentation and bilingual alignment were all done by Hui Kang.

English corpora

-The aiTECCL Corpus: The corpus consisting of two million words generated by the GPT-3.5 model, using identical writing prompts to those employed in the TECCL Corpus, aims to serve as a reference corpus that exhibits a native-like linguistic quality.
-CROWN2021: A Brown family American English corpus of one million words. Download CROWN2021. Download parsed version of CROWN2021.
-CLOB2021 (under construction): A Brown family British English corpus of one million words.
-China English Corpus (under construction): The corpus aims to gather one million words of edited present-day English published in China. As of 22 May 2021, the collection of news texts China English Corpus has been completed. Ms. Zhang Chentingyan has contributed 334 texts to the collection.
-The DEAP (Database of English for Academic Purposes) Corpus (under construction) aims to collect texts of over 100 million words covering 20 or more disciplines. BioDEAP (Biology sub-corpus), EcnDEAP (Economics sub-corpus), LinDEAP (Linguistics sub-corpus), MatDEAP (Materials Science sub-corpus), MedDEAP (Medicine sub-corpus), MilDEAP (Military Science sub-corpus) and PsyDEAP (Psychology sub-corpus) have been finalised. Each sub-corpus is composed of 5 million words.
-The DEAP Baby (V1.0) Corpus is a balanced multi-discipline English for Academic Purposes corpus based on the resampling of the 125-million-word DEAP (Database of English for Academic Purposes) Corpus.
-CLOB corpus: A Brown family British English corpus of one million words published largely in 2009) developed under the leadership of Jiajin Xu and Maocheng Liang. An article describing the corpus was published in the 2013 issue of ICAME Journal. Download CLOB (18.2MB). Please find a detailed description of CLOB corpus at CoRD corpus resource database of Helsinki University.
-Crown corpus: A Brown family American English corpus of one million words published largely in 2009, developed under the leadership of Jiajin Xu and Maocheng Liang. An article describing the corpus was published in the 2013 issue of ICAME Journal. Download Crown (18.2MB). Please find a detailed description of Crown corpus at CoRD corpus resource database of Helsinki University.
-The MedAca (Medical English discourse of Academia) corpus -- Clinical medicine component (MedAca医学学术英语语料库-临床医学子库) contains medical English research article texts (of 18 different subject areas of clinical medicine) of five million words. The building of the corpus was proposed by Jiajin Xu and the text gathering was undertaken by a group of English teachers (namely, Feng Xin, Qi Hui, Wu Jingjing, Ye He, Wan Ling, and You Sheng) at the School of Foreign Languages, Fujian Medical University. The first release, i.e. Version 1 of the MedAca corpus was compiled in 2015. You can download MedAca 1.0 word list here. The one-million-word version MedAca corpus V1.0 is now searchable online at http://114.251.154.212/cqp/. The corpus is now renamed as MedDEAP under the parent project DEAP.
-The MedDEAP Corpus: The Version 2 of the MedAca Corpus (finalised on 8 August 2017) was renamed as MedDEAP, which has officially become a component of the DEAP corpus family. Version 1 data was incorporated as part of the new Version 2 MedAca/MedDEAP corpus (which consists of 5,041,631 tokens and 99,765 types in 1,186 files). MedDEAP is available online at http://114.251.154.212/cqp/.
-TIME Magazine Corpus (1923-2008) of about 196 million words, which was twice the size of the BYU Time magazine corpus (by Mark Davies). The text collection was obtained a few years ago and mounted to BFSU CQPweb in late 2015. The corpus size is about 196 million words.
-The Independent Corpus gathered texts from The Independent--a British national morning newspaper--between 2009 and 2015. The corpus size is about 231 million words. The corpus was built by Liangping Wu.
-NESSIE Corpus 1st release (NESSIEv1, Native English Speakers Similarly or Identically-prompted Essays). Download the corpus here.
-PATTIE (Preschoolers- and Teenagers-oriented Texts in English) corpus compiled by Dr. Jie Ji. The construction of the corpus was completed in late 2014. PATTIE will be available soon via BFSU CQPweb. The corpus can be downloaded here for personal research only.

Chinese corpora

-ToRCH2009 Corpus (ToRCH2009现代汉语平衡语料库): Texts of Recent CHinese corpus 2009 (A Brown family Chinese corpus of one million words) developed under the leadership of Jiajin Xu. Download ToRCH2009 here.

-ToRCH2014 Corpus (ToRCH2014现代汉语平衡语料库): Texts of Recent CHinese corpus 2014 (A Brown family Chinese corpus of one million words) developed under the leadership of Jiajin Xu. Download ToRCH2014 here.

-ToRCH2019 Corpus (ToRCH2019现代汉语平衡语料库): Texts of Recent CHinese corpus 2019 (A Brown family Chinese corpus of one million words) developed under the leadership of Jiajin Xu. Download ToRCH2019 here.

-The BFSU DiSCUSS (Diversified Spoken Chinese Uttered in Social Settings) Corpus is a balanced corpus of spoken Chinese for public access. The corpus was designed by Jiajin Xu and the spoken data were collected by Jiajin Xu, Tong Dong, Mingchen Sun, Zhe Chen, Fangfang Liu, Bo Wang, Yan Wang, Lihong Quan, Zhouye Zhu and Jun Lu. A slim version of the BFSU DiSCUSS Corpus will customised to be the spoken component of the ICC-Chinese corpus (ICC-CN) (https://korpus.cz/icc). Download the DiSCUSS corpus.

-ICC-Chinese (ICC-CN): International Comparable Corpus - Chinese Component. Completed in April 2022.

English-Chinese parallel corpora

-燚炎英汉平行语料库. The corpus was designed by Jiajin Xu and compiled by Xiuling Xu and Jiajin Xu. The corpus can be downloaded from here.

-TED English Chinese parallel corpus of speeches: 6,187,849 English words and Chinese characters, collated by Jiajin Xu based on Web Inventory of Transcribed and Translated Talks. The corpus can be downloaded here.

-An updated version of TED English-Chinese parallel corpus created by Linwei Yang.

-Conference Interpreting Corpus under construction (as of 29 March 2017).

-The ECCE Corpus 1.0 (ECCE英汉社论平行语料库1.0): ECCE is pronounced as ['eki]. The ECCE (English Chinese Corpus of Editorials) corpus 1.0 was created by Linwei Yang and his MA students at Yantai University before Linwei joined the Ph.D. programme at the National Research Centre for Foreign Language Education of Beijing Foreign Studies University. The bilingual texts of ECCE were originally extracted from The Financial Times website, and sentence-aligned by Linwei's team. The corpus size of ECCE 1.0 is 238,363 English words and 424,921 Chinese characters.

-The ECCE Corpus 2.0: (ECCE英汉社论平行语料库2.0). The corpus was compiled by Linwei Yang and his students at Yantai University.

Corpora or text collections prepared by colleagues beyond the FLERIC team.

-85 translations of "Tao Te Ching", "Laozi", "Dao De Jing".

-Download BNC XML edition from here.