当前位置: HOME >> CORPORA >> Content

Torch 2009 Corpus

发布者: [发表时间]:2019-09-24 [来源]: [浏览次数]:

Torch 2009 Corpus

This is a quick description of the 2009 Brown family Chinese corpus, Torch. Torch is the acronym of Texts Of Recent CHinese. The Torch project was initiated under the name of CC2009 meaning Chinese corpus 2009. The new name Torch was proposed by Xu Jiajin and underwent several rounds of email discussions among members of Corpus Research Group at Beijing Foreign Studies University. Most members seemed to agree that it is a memorable and meaningful name, the naming of Torch was not a unanimous vote though.

The corpus contains 671 texts covering 15 text types (Press: Reportage, Press: Editorial, Press: Reviews, Religion, Skill and hobbies, Popular lore, Belles-lettres, Miscellaneous: Government & house organs, Learned, Fiction: General, Fiction: Mystery, Fiction: Science, Fiction: Adventure, Fiction: Romance, and Humour).

Most texts in the corpus were published in 2009.

This edition of Torch corpus was a tokenised/segmented one using ICTCLAS (the YACSI interface). The manual check of the tokenisation shows that the accuracy rate is over 95%.

This edition of Torch is called ‘TORCH 2013 summer edition’, which accepts ICTCLAS tokenised texts on an as-is basis. In other words, the mis-tokenised words were not corrected.

Later, all problematic tokenisations will be corrected by human analysts, thus yielding an updated edition of Torch Corpus. The new edition will be made available through BFSU CQPweb (http://111.200.194.212/cqp/) and Corpus4U (http://www.corpus4u.org/forum/).

You can cite the corpus as:

Xu, Jiajin. 2013. Torch Corpus: Texts of Recent Chinese (2013 summer edition).