We are excited to announce that the AIGC corpus - the aiTECCL Corpus, is now available to all members of the research community. The aiTECCL Corpus was compiled by Jiajin Xu and Mingchen Sun of the Corpus Research Group of the National Research Centre for Foreign Language Education at Beijing Foreign Studies University.
The corpus, which contains two million words generated by the GPT-3.5 model using identical writing prompts to those employed in the TECCL Corpus, aims to serve as a reference that is close to the linguistic quality of L1 English speakers. The corpus is made available online on 9 August, 2023.
URL: http://114.251.154.212/cqp/
Username: test
Password: test
Please cite: Xu, Jiajin & Mingchen Sun. 2023. aiTECCL: An AIGC English Essay Corpus. Beijing: National Research Centre for Foreign Language Education, Beijing Foreign Studies University.
Justifying the concept of "AIGC Corpus" (Artificial Intelligence Generated Content Corpus) or Generative Corpus
The creation of the AIGC Corpus helps expand the concept of "corpus". In the classic definition of a corpus, the included materials must be language collections that are authentically or naturally occurring in real-life communication. Clearly, generative texts do not fall under this category. We believe that the rationale for the generative corpus can be viewed from at least three aspects:
1. The so-called principle of "authenticity" itself is a matter of degree. For example, whether essays written by learners under exam conditions belong to genuine communication is questionable. In existing research, some elicited data also has authenticity issues similar to those found in learners' interlanguage. Therefore, from the perspective of existing corpora, there are texts with varying degrees of authenticity.
2. The generative corpus can serve as an essential complement to existing corpora. The emergence of the generative corpus can reconcile the distinction between "probable language" and "possible language." For linguistic instances that have not yet appeared in reality, they can be generated using large language models.
3. Creating a corpus using artificial intelligence technology is a second-to-best solution under the current conditions for building specific types of corpora. For example, the aiTECCL corpus simulates a reference corpus of approximately 10,000 essays, close to the L1 English speaker language quality, and written on the same topics as Chinese learners. Without the use of artificial intelligence methods for generation, it might be impossible to obtain a reference corpus of such quality and comparability. Similarly, for corpus construction of languages from least-developed countries or countries with extremely small populations, without generative technology, it would be impossible to establish in the short term.
Further details about the prompt and the Python script we utilised to create the corpus will be provided on the site soon.
大语言模型生成式语料库aiTECCL语料库建成
2023年8月9日,北京外国语大学中国外语与教育研究中心语料库研究团队建成大语言模型生成式语料库The aiTECCL Corpus。
请引用:许家金、孙铭辰,2023,aiTECCL生成式作文语料库。北京:中国外语与教育研究中心。
关于“人工智能生成式语料库”(AIGC Corpus或Generative Corpus,中文简称“生成式语料库”)的说明
生成式语料库的出现有助于拓展“语料库”概念的内涵和外延。在语料库的经典定义中,所收语料必为现实交际中真实运用(authentic或naturally occurring)的语言汇集。生成式文本显然不属此类。我们认为至少可以从三个方面来看待生成式语料库的合理性。
第一、所谓真实性(authenticity)原则本身也是个程度问题。例如,学习者在考试状态下所写的作文,是否属于真实交际,也是存疑的。在现有研究中,一些诱导数据也存在与学习者中介语类似的真实性问题。因此,从既有的语料库来看,也存在不同程度真实性的语料文本。
第二、生成式语料库可以作为现有语料库的重要补充。生成式语料库的出现可以调和“盖然语言”(probable language)和“可能语言”(possible language)的分野。对于现实中尚未出现的语言用例,可通过大模型予以生成。
第三、用人工智能技术生成语料库是在现有条件下建设特定类型语料库的次优解。例如,aiTECCL语料库模拟的是约1万篇近似英语本族语者水平,且与中国学习者同题作文的参照语料库。若不用人工智能方法生成,恐无法获得如此质量、高度可比和对应的参照语料库。同理,对于一些极度欠发达或者人口规模极小国家语种的语料库建设,若不采用生成式技术,恐在短期内绝无可能建成。
aiTECCL's counterpart learner corpus: The TECCL Corpus
Please visit: http://corpus.bfsu.edu.cn/info/1070/1449.htm