The Common Crawl corpus contains petabytes of data collected over the last 7 years. It contains raw web page data, extracted metadata and text extractions.
Data Location
The Common Crawl dataset lives on Amazon S3 as part of theAmazon Public Datasetsprogram. From Public Data Sets, you can download the files entirely free using HTTP or S3.
http://commoncrawl.org/