top | item 34684801

(no title)

greg7mdp | 3 years ago

Page and Brin's original crawler had the issue, but they were unable to fix it, and it was Jeff Dean and Sanjay Gupta that rewrote the crawler so that it and the index storage would be hardware fault tolerant.

discuss

dekhn|3 years ago

The crawler doesn't interact directly with pagerank. PageRank is computed later and is attached as a per-document value.

I asked Jeff about what it was like in the early days and he told me: when they first joined, the entire crawl to index to serving stack was documented in a README that you would follow, typing commands and waiting for each step to complete. A failure in a step meant completely starting over (for that step) or even earlier, depending on how and where temp data was materialized.

He said he and Sanjay (Ghemawat, not Gupta) then wrote mapreduce as a general purpose tool for solving multiple steps in crawl to servable index. Not only is mapreduce good at restarting (if the map output and the shuffle output are persistent), the design of mapreduce naturally lends itself to building an indexing system.

If you go back to the old papers you'll see several technologies mentioned over and over. protocol buffers, recordio, and sstable: the first is an archive format to store large amounts of documents in small number of sharded files, the second is a key-sorted version of the same data (or some transformed version of the data). So, building an inverted index is trivial: your mapper is passed documents and emits key/value pairs (token, document) if a token is in the document. The shuffler automatically handles grouping all the keys, and sorting the values, which produces a fairly well-organized associative table (in the format of sstables).

BigTable came about because managing lots of sstables mutably became challenging. | MapReduce was replaced with Flume, which was far more general and easier to work with, and BigTable was replaced with Spanner (ditto), and GFS replaced with Colossus, but many of the underlying aspects of how things are done at Google in prod are based on what Jeff, Sanjay, and a few others did a long time ago.

Note that mapreduce isn't particularly innovative except the scaling aspects were fairly esoteric at the time.