top | item 3346125

CommonCrawl: an open repository of web crawl data that is universally accessible

92 points| abhishektwr | 14 years ago |commoncrawl.org | reply

8 comments

[+] abhishektwr|14 years ago|reply

Just a pointer, the code for CommonCrawl Project is available on Github https://github.com/commoncrawl/commoncrawl

[+] pooyak|14 years ago|reply

thread on HN from when common crawl was announced, interesting info there: http://news.ycombinator.com/item?id=3209690

[+] fungi|14 years ago|reply

If you into said things then maybe http://yacy.net/ (p2p crawler and search) will be useful to you as well.

[+] Titanous|14 years ago|reply

The latest data available is from 2010-09-25, which seems to be too old to be useful for most things.

[+] rgrieselhuber|14 years ago|reply

It would be great to hear more about the tools they are using to crawl and potentially open it up to more people who want to contribute computing resources.

[+] unknown|14 years ago|reply

[deleted]

[+] emilis_info|14 years ago|reply

This one may be also interesting for open data devs: http://scraperwiki.com/

[+] Aloisius|14 years ago|reply

I hear a lot of people are crunching on CommonCrawl's data. It'll be interesting the type of stuff people come up with!

[+] nithinag|14 years ago|reply

This looks really nice!