We are excited to announce that the AIGC corpus - the aiTECCL Corpus, is now available to all members of the research community. The aiTECCL Corpus was compiled by Jiajin Xu and Mingchen Sun of the Corpus Research Group of the National Research Centre for Foreign Language Education at Beijing Foreign Studies University.
The corpus, which contains two million words generated by the GPT-3.5 model using identical writing prompts to those employed in the TECCL Corpus, aims to serve as a reference that is close to the linguistic quality of L1 English speakers. The corpus is made available online on 9 August, 2023.
Username: test
Password: test
Please cite: Xu, Jiajin & Mingchen Sun. 2023. aiTECCL: An AIGC English Essay Corpus. Beijing: National Research Centre for Foreign Language Education, Beijing Foreign Studies University.
Justifying the concept of "AIGC Corpus" (Artificial Intelligence Generated Content Corpus) or Generative Corpus
The creation of the AIGC Corpus helps expand the concept of "corpus". In the classic definition of a corpus, the included materials must be language collections that are authentically or naturally occurring in real-life communication. Clearly, generative texts do not fall under this category. We believe that the rationale for the generative corpus can be viewed from at least three aspects:
1. The so-called principle of "authenticity" itself is a matter of degree. For example, whether essays written by learners under exam conditions belong to genuine communication is questionable. In existing research, some elicited data also has authenticity issues similar to those found in learners' interlanguage. Therefore, from the perspective of existing corpora, there are texts with varying degrees of authenticity.
2. The generative corpus can serve as an essential complement to existing corpora. The emergence of the generative corpus can reconcile the distinction between "probable language" and "possible language." For linguistic instances that have not yet appeared in reality, they can be generated using large language models.
3. Creating a corpus using artificial intelligence technology is a second-to-best solution under the current conditions for building specific types of corpora. For example, the aiTECCL corpus simulates a reference corpus of approximately 10,000 essays, close to the L1 English speaker language quality, and written on the same topics as Chinese learners. Without the use of artificial intelligence methods for generation, it might be impossible to obtain a reference corpus of such quality and comparability. Similarly, for corpus construction of languages from least-developed countries or countries with extremely small populations, without generative technology, it would be impossible to establish in the short term.
Further details about the prompt and the Python script we utilised to create the corpus will be provided on the site soon.
2023年8月9日,北京外国语大学中国外语与教育研究中心语料库研究团队建成大语言模型生成式语料库The aiTECCL Corpus。
关于“人工智能生成式语料库”(AIGC Corpus或Generative Corpus,中文简称“生成式语料库”)的说明
生成式语料库的出现有助于拓展“语料库”概念的内涵和外延。在语料库的经典定义中,所收语料必为现实交际中真实运用(authentic或naturally occurring)的语言汇集。生成式文本显然不属此类。我们认为至少可以从三个方面来看待生成式语料库的合理性。
第二、生成式语料库可以作为现有语料库的重要补充。生成式语料库的出现可以调和“盖然语言”(probable language)和“可能语言”(possible language)的分野。对于现实中尚未出现的语言用例,可通过大模型予以生成。
aiTECCL's counterpart learner corpus: The TECCL Corpus
Please visit: http://corpus.bfsu.edu.cn/info/1070/1449.htm