当前位置: HOME >> CORPORA >> Content

4.8 billion token Swedish web corpus available (SVCOW14)

发布者: [发表时间]:2019-09-24 [来源]: [浏览次数]:

As the culmination of more than two years of work on the next generation

COW web corpora, a series of giga-token COWs in Dutch, English, French,

German, Spanish, Swedish is now leaving the processing tool chain. The

Swedish corpus is the first to become available. It is a 4.8 billion

token sentence shuffle corpus derived from an unshuffled 8.6 billion

token corpus. Next in line are (in this order) Dutch, English, German.

Website:       http://hpsg.fu-berlin.de/cow/

Download:     http://hpsg.fu-berlin.de/cow/download/

Web interface:http://hpsg.fu-berlin.de/cow/colibri/

SVCOW14AX maintainer: Roland Schäfer < mail@rolandschaefer.net>

COW initiative 2011-2014: Felix Bildhauer, Roland Schäfer

Best regards,

Roland

===== SUMMARY OF SVCOW14AX CORPUS PROPERTIES =====

* freely available under a restrictive academic license

* crawled in 2012 and 2014 in the TLDs .se and .fi

* vertical format with token/POS/lemma columns in minimal XML

* ready for encoding in versions of CWB which have UTF-8 support

* processed with texrex (http://texrex.sourceforge.net/) for:

  + markup stripping

  + UTF-8 transcoding and checking

  + entity conversion

  + heuristic repairs of broken encodings

  + document quality assessment using frequencies of short words:

    Schäfer et al. (2013) [http://bit.ly/VSmK6M]

  + boilerplate status classification for text blocks:

    Schäfer (2014, draft) [http://bit.ly/VSmK6M]

  + document de-duplication using classic w-shingling:

    Schäfer & Bildhauer (2012) [http://bit.ly/1zJIqiT]

* run-together sentences fixed with rofl (included in texrex)

* hard-coded hyphenation removed with HyDRA (included in texrex)

* tokenization with ucto and custom scripts

* POS tagging with HunPos

* lemmatization with custom tools

* meta data encoded in the released version:

  + document ID

  + document URL

  + server geolocation from GeoLite by MaxMind (http://www.maxmind.com)

  + document quality score

  + boilerplate score

  + crawl date

  + last-modified (if available)