当前位置: HOME >> CORPORA >> Content

Common Crawl Web Data

发布者: [发表时间]:2019-09-24 [来源]: [浏览次数]:

The Common Crawl corpus contains petabytes of data collected over the last 7 years. It contains raw web page data, extracted metadata and text extractions.

Data Location

The Common Crawl dataset lives on Amazon S3 as part of theAmazon Public Datasetsprogram. From Public Data Sets, you can download the files entirely free using HTTP or S3.

 

http://commoncrawl.org/