当前位置: HOME >> FLERIC News >> Content

各语种单词正则表达式:Token definitions in regex for multiple languages

发布者: [发表时间]:2021-07-16 [来源]: [浏览次数]:

英语词数(Regular expression for English words)

[A-Za-z0-9-]+

汉语字数(Regular expression for Chinese characters)

[\u4e00-\u9fa5]|[A-Za-zA-Za-z0-90-9\.%%]+

或者

[一-龥]|[A-Za-zA-Za-z0-90-9\.%%]+

汉语词数(Regular expression for tokenised Chinese words)

[\u4e00-\u9fa5A-ZA-Za-za-z0-90-9\.%%]+

或者

[一-龥A-ZA-Za-za-z0-90-9\.%%]+

日语词数(Regular expression for tokenised Japanese words)

[A-Za-za-zA-Z0-90-9一-龠ぁ-んァ-ヴー〇々%%]+

德语词数(Regular expression for German words)

[A-Za-zÄäÖöÜüß0-9-]+

西班牙语词数(Regular expression for Spanish words)

[A-Za-záéíñóúÁÉÍÑÓÚüÜ0-9-]+

冰岛语词数(Regular expression for Icelandic words)

[A-Za-z0-9ÁáÐðÉéÍíÓóÚúÝýÞþÆæÖö-]+

阿尔巴尼亚语词数(Regular expression for Albanian words)

[a-zA-Z0-9ÇËçë-]+

加泰罗尼亚语词数(Regular expression for Catalan words)

[a-zA-Z·ÀÇÉÈÍÏÒÓÜÚàçéèíïòóüú0-9%'-]+

芬兰语词数(Regular expression for Finnish words)

[a-zA-Z0-9åäöšüžøõãíóúýèìòùâêîôûïÿ-]+

斯瓦西里语词数(Regular expression for Swahili words)

[a-zA-Z0-9'-]+

丹麦语词数(Regular expression for Danish words)

[a-zA-ZæÆøØåÅéÉ0-9-]+

阿姆哈拉语词数(Regular expression for Amharic words)

[\u1200-\u135c\u1369-\u137f0-9a-zA-Z]+

马达加斯加语词数(Regular expression for Malagasy words)

[a-zA-Z0-9àÀôÔỳỲâéèêëìîïñÂÉÈÊËÌÎÏÑ-']+

荷兰语词数(Regular expression for Dutch words)

[A-Za-zÁáÀàÂâÄäÉéÈèÊêËëÍíÌìÎîÏïIJijÓóÒòÔôÖöÚúÙùÛûÜüÝýŸÿƒ0-9-']+

波斯尼亚语、克罗地亚语、塞尔维亚语、黑山语词数(Regular expression for Bosnian, Croatian, Serbian & Montenegrin words)

[a-zA-Z0-9ČčĆćĐđDŽDždžLJLjljNJNjnjŠšŽž-]+

祖鲁词数(Regular expression for Zulu words)

[A-Za-z0-9-']+


注:上述表达式中默认大小写敏感,即软件设置中case sensitive选项是勾选上的。若case sensitive未选中,则A-Za-z只需写a-z或A-Z即可。


Last modified by Jiajin Xu on 24 Jan, 2022.