top | item 44974143

(no title)

mikaraento | 6 months ago

Around 2008 a core step in search was basically a grep over all documents. The grep was distributed over roughly 1000 machines so that the documents could be held in memory rather than on disk.

Inverted indices were not used as they worked poorly for “an ordered list of words” (as opposed to a bag of words).

And this doesn’t even start to address the ranking part.

discuss

smokel|6 months ago

It seems highly unlikely that they did not use indices. Scanning all documents would be prohibitively slow. I think it is more likely that the indices were really large, and it would take hundreds to thousands of machines to store the indices in RAM. Having a parallel scan through those indices seems likely.

Wikipedia [1] links to "Jeff Dean's keynote at WSDM 2009" [2] which suggests that indices were most certainly used.

Then again, I am no expert in this field, so if you could share more details, I'd love to hear more about it.

[1] https://en.wikipedia.org/wiki/Google_data_centers

[2] https://static.googleusercontent.com/media/research.google.c...

bruckie|6 months ago

I worked on search at Google around that timeframe, and it definitely used an index. As far as I know, it has from the very beginning.

You can solve the ordered list of words problem in ways that are more efficient than grepping over the entire internet (e.g. bigrams, storing position information in the index).