top | item 40514491

Google Algorithm Leaked

90 points| certifiedloud | 1 year ago |seroundtable.com

10 comments

order

advisedwang|1 year ago

It's not clear to me whether the leak is actually for Google Search or one of the products around search that isn't "Search", like Document Warehouse [1]. Is there anything definitive one way or the other in all this? Nobody seems to even questioning this

[1] https://cloud.google.com/document-warehouse/docs/overview

9dev|1 year ago

If you read the original publication on this[1], they mention there’s a stray commit publishing the internal variant of the SDK intended for the actual Google warehouse database. So the code bases probably live close enough together for someone to accidentally pass the wrong folder name or something.

This has been fixed, but the commit and all it’s changes are out there—and tragically, published alongside a copy of the Apache 2.0 license (intended for the document warehouse API SDK), which officially sanctioned freely copying and using the code. So there is really nothing Google can do about it.

[1] https://ipullrank.com/google-algo-leak

avallach|1 year ago

The post title is misleading. The algorithm did not leak, only the documentation listing all the signals that can possibly be used as inputs for that algorithm. It doesn't reveal which ones are actually used and how.

atonse|1 year ago

This looks like it's written in Elixir (the docs are using ExDocs, Elixir's documentation toolset).

This can't possibly be the actual search index rules (which is probably code that's decades old, my guess is either in Python or Java?) – unless they rewrote all of it in the past few years?

Can anyone else confirm this?

9dev|1 year ago

It’s not. Google uses a content warehouse database internally that holds all stored web page content, and to access this vast database, they have an API. The code discovered here is a generated SDK for Elixir for this content warehouse API.

Apparently, Google had a now deprecated product (who would have guessed that? Consider me shocked!) that provided customers with a trimmed-down version of this database for their own purposes, but mistakenly published the internal SDK code instead of that intended for Google Cloud customers to GitHub.

So while this doesn’t directly show the search index source code, it describes the data schema of the index in great detail, so there are at least some interesting educated guesses on the workings of the actual index to draw from it.