当前位置: HOME >> CORPORA >> Content

The English ENCOW14 web corpus now available

发布者: [发表时间]:2019-09-24 [来源]: [浏览次数]:

The English ENCOW14 web corpus is now available in its first release

version ENCOW14A (16.8 GT full corpus, 9.6 GT shuffled). The shuffle

version is completely free but available only to people working in the

academia.

At the same time, we make available our new Colibri² web application

hosted at webcorpora.org. It allows registered users to query the

corpora or download the whole data sets. Colibri² also serves DECOW12AX

(German, 8.3 GT), NLCOW14AX (Dutch, 4.7 GT), SVCOW14AX (Swedish, 4.8 GT).

ENCOW14A was crawled in 2012 and 2014 in over 20 top-level domains, has

undergone state-of-the-art deduplication, boilerplate removal,

hyphenation repair and repair for run-together sentences (texrex). It is

annotated with POS (Penn/TreeTagger), lemma (TreeTagger), chunks

(TreeTagger), as well as dependency relations (MaltParser,

experimental). It contains the following meta data: URL, Last-Modified

date, crawl date, country and city geolocation, and document quality

score as well as paragraph boilerplate scores.

Download & web access via Colibri² (free registration required):

https://webcorpora.org/

Corpus information:

http://corporafromtheweb.org/encow14/

COW is created at Freie Universität Berlin, German Grammar Group:

http://hpsg.fu-berlin.de/

All processing specific to web documents was done with texrex:

http://texrex.sourceforge.net/

ENCOW14 includes GeoLite data created by MaxMind, available from:

http://www.maxmind.com.

Best regards,

Roland Schäfer (ENCOW14/COW), Felix Bildhauer (COW)