当前位置: HOME >> FLERIC News >> Content

Farsi corpus available北外波斯语语料库建成

发布者: [发表时间]:2022-07-23 [来源]: [浏览次数]:

The faGLOBE Corpus (V1.0)

INTRODUCTION

The faGLOBE Corpus (Version 1) is a balanced collection of contemporary Farsi (Persian) written texts, totaling one million words.

The text samples in the corpus were gathered and cleaned up by Yanjun Li (李彦军) and three students of Persian, namely, Shuainan Chen (陈帅楠), Qi Hu (胡奇) and Tinglu Zhou (周汀鹭), of the School of Asian Studies, Beijing Foreign Studies University (BFSU).

The online version of the faGLOBE Corpus is available at http://114.251.154.212/cqp/. Both user ID and passcode are ‘test’.


Key information

Project leader: Yanjun Li (李彦军), School of Asian Studies, BFSU (https://asian.bfsu.edu.cn/info/1046/1626.htm)

Text collectors: Shuainan Chen (陈帅楠), Qi Hu (胡奇) and Tinglu Zhou (周汀鹭), of the School of Asian Studies, BFSU

Time of compilation: January 2022 – July 2022

Size: Approximately one million words

Language: Contemporary Farsi

Number of texts/samples: 500 samples of 2000+ words each (Short texts are pieced together to form one 2000-word text, but saved separately and marked A, B, C etc. in the filenames.)

Period: The bulk of the texts were published between 2010 and 2022. The remaining texts were published in the first few years after the year of 2000.

Released: July 2022


BACKGROUND

On 29 December 2021, Jiajin Xu launched the GLOBE (Global Languages Out of BFSU Expertise) Corpus project, an initiative which aims to collect present-day written texts in all 101 languages that are taught at BFSU. The sampling frame of the Brown Corpus was followed to make the multilingual GLOBE corpus family comparable to the Brown family corpora. The immediate application of the GLOBE is meant to be corpus-based dictionary compilation. The first batch of the corpora covers about 30 languages.

The faGLOBE Corpus is a sub-project of the BFSU-funded GLOBE Corpus projects (Ref. 2022SYLZD015 and 2022SYLPY004), whose principal investigator is Prof. Jiajin Xu at the National Research Centre for Foreign Language Education, BFSU. Out of the projected corpora of 101 languages, the Farsi corpus is the first corpus made publicly available.


faGLOBE语料库(1.0版)

faGLOBE语料库(1.0版)是当代波斯语平衡语料库。该库总容量约为100万词。

faGLOBE中的语料样本由北京外国语大学亚洲学院李彦军老师及三位波斯语专业学生陈帅楠、胡奇、周汀鹭共同采集、加工完成。

该库可通过北外多语种语料库平台BFSU CQPweb在线访问:http://114.251.154.212/cqp/。账号密码皆为test。


关键信息

faGLOBE语料库负责人:李彦军(北外亚洲学院,https://asian.bfsu.edu.cn/info/1046/1626.htm)

主要语料文本采集者:陈帅楠、胡奇、周汀鹭(北外亚洲学院)

建库周期:2022年1月至2022年7月

库容:约100万词

语言:当代波斯语

文本数:500个2000词文本(少于2000词的多个文本会在文件名末尾标注A、B、C等,以标明同属一个2000词的文本。)

出版年份:绝大分部文本发表于2010-2022年间。少量文本发表于新世纪的最初几年。

语料库发布时间:2022年7月


背景

2021年12月29日,北外启动了“北外全球语料库集群”项目,又称“GLOBE语料库”项目。GLOBE的英文全称为Corpus of Global Languages Out of BFSU Expertise。该语料库集群旨在建设北外开设的101个语种的当代书面语语料库。

北外全球语料库集群中的单语平衡库借鉴布朗语料库的采样方案,使之与现有布朗家族语料库具有可比性,从而可开展相关外英或外汉对比研究。建设该系列语料库的首要应用目的是开展基于语料库的多语种词典编纂。首批建设的GLOBE家族语料库约为30个语种。

faGLOBE波斯语平衡语料库是北外中国外语与教育研究中心许家金主持的北外双一流项目“北外全球语料库集群”(项目编号:2022SYLZD015及2022SYLPY004)的子课题。

faGLOBE波斯语平衡语料库是“北外全球语料库集群”中首个建成的非通用语种语料库。