Corpus tools developed by BFSUCRG members (北外语料库团队开发的语料库工具)
GUI tools created using ChatGPT and Python
* Please configure your antivirus to trust these newly developed tools. If execution is not allowed, use the right-click menu option to grant administrator privileges. 请配置您的杀毒软件,使之信任以下新开发的工具。若系统弹窗,显示无法运行软件,请在程序文件上,点击鼠标右键,选择“以管理员身份运行”即可。
- BFSU Bilingual Alignment Keeper beta: adds line-end markers to keep sentence alignment after POS tagging.
- BFSU Detagger: an updated version of the DeTagging Tool, originally created by Yunlong Jia. Detagger strips off tags in four different formats: underscore, forward slash, angle brackets, and square brackets from annotated texts.
- BFSU Inter-rater Agreement Gauge: compute inter-rater agreement measures among two or more raters.
- BFSU Logistic Regression Tool: performs logistic regression on user-provided datasets.
- BFSU_Log-likelihood_Calculator_with_ES: compares the frequency counts of one or more linguistic items across two (sub-)corpora.
- BFSU One Hot Encoder: converts categorical values into binary data for logistic regression.
- BFSU Precision Recall Evaluator: computes performance metrics for binary classification models.
- BFSU Spoken Utterances Extractor: extracts spoken utterances from literary works.
- BFSU TAG: enables batch text annotation via API using custom prompts.
- BFSU Texcel: extracts text data from Excel files and saves as individual text files.
- BFSU Text Merger: merges the selected files into one text file.
- BFSU Text Randomizer 2: extracts random samples from an uploaded text or multiple texts within a folder.
- BFSU Text Segmenter: splits text files into smaller individual files.
- BFSU XML2TXT Converter: transforms (BNC 2014 spoken) XML files into *.txt files.
- RTF2TXT Converter: batch converts *.rtf documents to *.txt files. The tool was developed by Wanbo Ren from the School of Foreign Languages at Northwest University, China.
使用大模型开发软件的操作视频Tutorial video on developing software using large language models
GitHub BFSUNLP Page: https://github.com/bfsunlp
R and Python scripts and tutorials
-Mixed_effects_logistic_regression (prepared by Dong Zhang 张懂)
-Word2vec tutorial (prepared by Hailong Deng 邓海龙). For a detailed tutorial, please see 邓海龙,2019,Python词向量训练与应用技术解析,《语料库语言学》(2).
Concordancers and query tools ( 语料库检索工具 )
-BFSU PowerConc 1.0 beta25b.PowerConc video tutorial 操作视频。网友自制操作视频
-BFSU CQPweb online concordancer (download CQPweb tutorial here. 请下载CQPweb简明图文使用手册).
CQP syntax高级检索使用说明-BFSU ParaConc 1.2.1: A freeware parallel concordancer
-Colligator 2.0: A colligation query and analysis tool (1.4MB)
-SearchSubtitle: A programme for video-based time-aligned subtitle concordancing (Chinese user interface). The tool was designed by Wenzhong Li and programmed by Zhaoyang Han (533KB).
-PatCount 1.0: PatCount is the abbreviated form of 'pattern counting'. It is a query tool of counting the frequency of lexical, syntactic, and discoursal features in texts. The result of the tool is shown and can be exported as 'feature(s) x text(s)' matrices, which is most suitable for follow-up advanced (inferential) statistical analyses. Regular expressions are fully supported in the tool. Microsoft .Net framework is required before you run the tool. The tool was designed by Maocheng Liang and Wenxin Xiong and programmed by Wenxin Xiong (3.6MB).
-PatCount 2.0 (wxPatCount): An updated version of PatCount, written by Professor Maocheng Liang.
-BFSU ConcGram Lite: This is straightforward and easy-to-use tool for retrieving contiguous and non-contiguous bigrams with directonal variations based on the search of two target words.
Annotation tools ( 语料库标注工具 )
-BFSU Stanford POS Tagger (two tagging models) (13.3MB)
-BFSU Stanford POS Tagger Lite (one tagging model, left3words) (4.1MB)
-BFSU Stanford Parser(4.8MB)
-BFSU Syntactic Complexity Analyzer. The tool does not work on 64bit OS.
-BFSU Qualitative Coder 1.2. The tool assists manual annotation based on a user-defined category template (e.g. taxonomy of speech acts). A semi-automatic feature is also enabled given some heuristic knowledge, e.g. a thesaurus, or recurrent lexico-grammar on the basis of preliminary human annotation. 网友自制操作视频
-BFSU Qualitative Explorer.The tools enables the counting of annotated features on the basis of multiple files.
-YACSI Chinese tokeniser and POS tagger version 0.96 (8.4MB). Version 0.96 is a more stable release, your system time has to be set to a date earlier than 2012 though.The tool was written by Liangping Wu.
-KAT tool (or KWIC-based Annotation Tool): The tool enables users to automatically group the patterns of a search term based on a couple of human identified patterns. A text file is needed for the annotation tool to proceed. The text file or scheme file contains labels (one label per line) to which users would like to apply to the patterns identified out of concordance lines.
-DeTagging Tool (414KB): The tool helps to strip off tags of four different formats (namely, underscore, forward slash, angle brackets, and square brackets) in annotated texts.
-MeCab Japanese word tokeniser and POS tagger Windows GUI compiled by Prof. Maocheng Liang.
Statistical tools for corpus analysis ( 语料库统计工具 )
-Log Likelihood Calculation Excel Spreadsheet (40KB) (see also Paul Rayson's log likelihood calculator)
-Fisher's Exact Test p-value Calculation Excel Spreadsheet (118KB) with a chi-square test score, effect size metrics of relative risk and odds ratio scores.
Specialised corpus tools ( 语料库分析专用工具 )
-BFSU Collocator (835KB) is a search-based collocation extraction tool which yields MI, MI3, T-score, Z-score, Log-Log, and Log likelihood scores of collocational strength. The tool works will raw and CLAWS-tagged PoS English texts, and does not work for texts of Chinese or other languages.
-BFSU English Sentence Segmenter 1.0 (447KB)
-Concordance Randomizer 1.2 (531KB)
-Keywords Plus 1.0 (1.87 MB) (an earlier release of Keywords Plus tool in which the resulting keywords are linked to their original concordance lines. This feature has not been retained in version 2.)
-Keywords Plus 2.0 (5.67 MB) (a free keyword generation tool based on the comparison of two corpora or wordlists. The tool is helpful of creating Chinese and English keyword lists, key Ngram lists and key POSgram lists.)
-Pattern Builder (7.2 MB) is an aid for those who are not familiar with regular expressions in searching PoS-tagged English texts.
-Readability Analyzer 1.0 (1.1MB): A tool which yields Readability indices, type/token ratio (TTR), standandarised type/token ratio (STTR), lemmatised TTR, lemmatised STTR, average word length, average sentence length, etc.
-Sub-corpus Creator: Sub-corpora can be extracted based on the text strings contained in filenames of texts or in-text metadata markup.
-SRT2TMX converter (双语SRT字幕转换为TMX格式工具) written by Linwei Yang
-Text Cleaning Library for PowerGREP (5.5KB)
-TextSmith Tools (6.1MB): This tool showcases a methodological innovation of a genre-informed phraseological profile across the discourse segments. TextSmith segments texts by an equal proportion, based on the users’ own intuitive estimation of the sections the imported texts might contain.
Data driven learning tools and resources ( 数据驱动学习工具 )
-BFSU Sentence Collector is a pedagogically motivated concordancing tool which allows users to refine search results according to sentence length and lexical difficulty. The results of the tool are displayed in complete sentences instead of the KWIC mode. To customise your own textual data for text collection. Please first of all segment the English texts on your own hard drive with BFSU Sentence Segmenter 1.0, and then mark up the unknown/new words based on a base word list with BFSU NewWords Marker 1.0 and save the data as an *.idx file into the index folder of BFSU Sentence Collector.
Useful tools and resources that were not developed by BFSU FLERIC members
-Batch Encoding Converter (Chinese interface) (985KB)
-Chinese character to romanisation pinyin conversion tool ( 汉字转拼音工具 Chinese interface)
-CorpusWordParser(肖航汉语分词工具)操作视频
-Flesh PC: a standalone tool for calculating Flesch Reading Ease, Flesch-Kincaid Grade Level and other descriptive text measures. (4.7MB)
-Duometer is a command-line tool that allows to efficiently identify near-duplicate pairs of documents in large collections of texts. It works on all platforms with Java runtime installed.
-Multidimensional Analysis Tagger 1.1 by Andrea Nini (official site here) (9.7MB). Read the MAT v. 1.1 Manual.
-Multidimensional Analysis Tagger 1.2 by Andrea Nini (official site here) (9.96MB). Read the MAT v. 1.2 Manual.
-Multidimensional Analysis Tagger 1.3 by Andrea Nini (official site here) (9.8MB). Read the MAT v. 1.3 Manual.
-PDF2TXT conversion tool (PDF 转 TXT 格式软件 , Chinese interface) (2.1MB).
-Qt Readability: The tool provides values for Flesch-Kincaid Grade Level, Flesch Reading Ease, Gunning-Fog Readability Index, SMOG, and Coleman-Liau Readability Index as well as statistics of words, sentences, and syllables used in the five formulae. (5.4MB)
-SDAU ParaConc. A parallel concordancer with which TMX and bitext text files can be searched.
-Superb Batch Renamer ( 批量改名工具 English interface freeware) (316KB)
-THULAC-GUI: Tsinghua U Chinese text tokeniser
-UTFCastExpress: Batch UTF-8 conversion from ANSI (993KB)
-WCopyfind: A programme for detecting duplicated files or plagiarism ( 文本查重工具 1.9MB)
-Wordaizer: From text to word cloud with a twist ( 词云工具 8.5MB)
-WordCreator: With the WordCreator, you can create random words or whole texts from letters, syllables, single words or sentence fragments which can be weighted by probabilities. In addition, many additional functions, such as counting of characters, bi- and trigrams or real syllables, are integrated. Official site. (740KB)
-WordSmith Tools V4.0 is freely downloadable now.
-Writefullclient. This tool provides concordance, near-synonym comparison and other features based on Google Books, Google Scholars, Google News, and Google Web. (32.6 MB)
-Zipfian Curve Excel Spreadsheet by (dzhigner) Zheng Ding of Luoyang Teachers' College (920KB)
-A host of free corpus tools at https://www.corpus-analysis.com