I loved the inter-lingual web page linkage visualization project. Any idea why Traitor won the contest? It seems very similar to regular "create inverted index with map reduce" problem, or am I missing something?
Common Crawl is awesome. I wonder how complex it would be to run a Google-like frontend on top of it, and how good the results would be after a couple days of hacking...
Very and probably not very good (Compare Gigablast to Google as an example of why its hard). Not to take anything away from Common Crawl but crawling is often one of the easier things to build when creating a search engine. A crawler can be as simple as
for(listofurls) {
geturl;
add urls to listofurls; }
Doing it on a large scale over and over is a harder problem (which common crawl does for you) but its not too difficult until you hit scale or want realtime crawling.
Building an index on 210 TB of data however... Assuming you use Sphinx/Solr/Gigablast you are going to need about 50 machines to deal with this amount of data with any sort of redundancy. That's just to hold a basic index which is not including "pagerank" or anything (Gigablast is a web engine so it might have that in there not sure). You aren't factoring in adding rankers to make it a webs search engine, spam/porn detection and all of the other stuff that goes with it. Then you get into serving results. Unless your indexes are in RAM you are going to have a pretty slow search engine. So add a lot more machines to hold the index for common terms in memory.
If someone is keen to do this however here are a list of articles/blogs which should get you started (wrote this originally as a HN comment which got a lot of attention so made it into a blog post) http://www.boyter.org/2013/01/want-to-write-a-search-engine-...
The massive advantages that Google has include over a decade of data on the pages that people actually visited in response to a specific query as well as having an in-memory index of the public web, parts of which are updated on the order of seconds to minutes.
I wonder if there is a viable business in maintaining an in-memory & up-to-date index of the public web & selling access to it, with a pricing model that scales according to the amount of computation you are doing on it.
It would be challenging. You've got a crawl, but one with a fair bit of spam in it, despite the donation of blekko metadata. Then you have to figure out ranking for keywords, something that the blekko metadata won't help you with at all.
Limited resources are the only reason. We are working on a subset crawl of ~3 million pages that will be published weekly starting two weeks from now. But doing the full crawl takes a lot of time, effort and money.
If I'm reading this correctly, it seems that the crawler managed to hit up a huge number of youtube video pages...but only a fraction of them. I couldn't find a total number of Youtube video count, but Youtube's own stats page says 200 million videos alone have been tagged with Content-ID (identified as belonging to movie/tv studios).
In any case, it's surprising to not see Wikipedia on there. English wikipedia has 4+ million articles, so it should be ahead of thefreedictionary.com
Good crawlers should typically avoid wikipedia links, to avoid the number of HTTP requests on wiki servers (and keep their costs down), esp. because they make available whole data dumps for download through a separate cheaper channel: http://en.wikipedia.org/wiki/Wikipedia:Database_download
Yes! Startups/commercial companies/etc can all use the data for free. The terms of use basically say, don't do anything illegal with it and a few other things, but it shouldn't affect the vast majority of uses.
Actually, tomorrow a video on a startup that uses Common Crawl data is getting posted.
From the FAQ: "Please refer to the Common Crawl Terms of Use document for a detailed, authoritative description of our Terms of Use guidelines, but, in general, you cannot republish the data retrieved from the crawl (unless allowed by fair use), you cannot resell access to the service, you cannot use the crawl data for any illegal purposes, and you must respect the Terms of Use of the sites we crawl."
Just replace 's3://' with 'https://s3.amazonaws.com/'. You can use this link [1], but it looks like most of them are returning "Access Denied", so you would likely need to login with your AWS username/password to access them.
You need an Amazon account - though the data is available for free, I think you need to specify your access key to actually fetch it.
From there you can grab the S3 command line tools (http://s3tools.org/s3cmd) or load it up from hadoop or through one of the various open source libraries (boto for instance).
[+] [-] mark_l_watson|12 years ago|reply
Some good stuff!
[+] [-] wicknicks|12 years ago|reply
[+] [-] Aloisius|12 years ago|reply
[+] [-] sylvinus|12 years ago|reply
[+] [-] boyter|12 years ago|reply
for(listofurls) { geturl; add urls to listofurls; }
Doing it on a large scale over and over is a harder problem (which common crawl does for you) but its not too difficult until you hit scale or want realtime crawling.
Building an index on 210 TB of data however... Assuming you use Sphinx/Solr/Gigablast you are going to need about 50 machines to deal with this amount of data with any sort of redundancy. That's just to hold a basic index which is not including "pagerank" or anything (Gigablast is a web engine so it might have that in there not sure). You aren't factoring in adding rankers to make it a webs search engine, spam/porn detection and all of the other stuff that goes with it. Then you get into serving results. Unless your indexes are in RAM you are going to have a pretty slow search engine. So add a lot more machines to hold the index for common terms in memory.
If someone is keen to do this however here are a list of articles/blogs which should get you started (wrote this originally as a HN comment which got a lot of attention so made it into a blog post) http://www.boyter.org/2013/01/want-to-write-a-search-engine-...
[+] [-] benhamner|12 years ago|reply
I wonder if there is a viable business in maintaining an in-memory & up-to-date index of the public web & selling access to it, with a pricing model that scales according to the amount of computation you are doing on it.
[+] [-] greglindahl|12 years ago|reply
[+] [-] jamesaguilar|12 years ago|reply
[+] [-] sytelus|12 years ago|reply
[+] [-] rgrieselhuber|12 years ago|reply
[+] [-] LisaG|12 years ago|reply
[+] [-] danso|12 years ago|reply
Table 2a purports to show the frequency of SLDs:
1 youtube.com 95,866,041 0.0250
2 blogspot.com 45,738,134 0.0119
3 tumblr.com 30,135,714 0.0079
4 flickr.com 9,942,237 0.0026
5 amazon.com 6,470,283 0.0017
6 google.com 2,782,762 0.0007
7 thefreedictionary.com 2,183,753 0.0006
8 tripod.com 1,874,452 0.0005
9 hotels.com 1,733,778 0.0005
10 flightaware.com 1,280,875 0.0003
If I'm reading this correctly, it seems that the crawler managed to hit up a huge number of youtube video pages...but only a fraction of them. I couldn't find a total number of Youtube video count, but Youtube's own stats page says 200 million videos alone have been tagged with Content-ID (identified as belonging to movie/tv studios).
In any case, it's surprising to not see Wikipedia on there. English wikipedia has 4+ million articles, so it should be ahead of thefreedictionary.com
[+] [-] wicknicks|12 years ago|reply
[+] [-] jjwiseman|12 years ago|reply
[+] [-] spimmy|12 years ago|reply
[+] [-] Aloisius|12 years ago|reply
Actually, tomorrow a video on a startup that uses Common Crawl data is getting posted.
[+] [-] CrazedGeek|12 years ago|reply
http://commoncrawl.org/about/terms-of-use/
[+] [-] res0nat0r|12 years ago|reply
and you just need to comply with the Common Crawl TOU: http://commoncrawl.org/about/terms-of-use/
[+] [-] natch|12 years ago|reply
[+] [-] WestCoastJustin|12 years ago|reply
[1] https://s3.amazonaws.com/aws-publicdatasets/
[+] [-] Aloisius|12 years ago|reply
From there you can grab the S3 command line tools (http://s3tools.org/s3cmd) or load it up from hadoop or through one of the various open source libraries (boto for instance).