It's not clear to me whether the leak is actually for Google Search or one of the products around search that isn't "Search", like Document Warehouse [1]. Is there anything definitive one way or the other in all this? Nobody seems to even questioning this
If you read the original publication on this[1], they mention there’s a stray commit publishing the internal variant of the SDK intended for the actual Google warehouse database. So the code bases probably live close enough together for someone to accidentally pass the wrong folder name or something.
This has been fixed, but the commit and all it’s changes are out there—and tragically, published alongside a copy of the Apache 2.0 license (intended for the document warehouse API SDK), which officially sanctioned freely copying and using the code. So there is really nothing Google can do about it.
The post title is misleading. The algorithm did not leak, only the documentation listing all the signals that can possibly be used as inputs for that algorithm. It doesn't reveal which ones are actually used and how.
This looks like it's written in Elixir (the docs are using ExDocs, Elixir's documentation toolset).
This can't possibly be the actual search index rules (which is probably code that's decades old, my guess is either in Python or Java?) – unless they rewrote all of it in the past few years?
It’s not. Google uses a content warehouse database internally that holds all stored web page content, and to access this vast database, they have an API. The code discovered here is a generated SDK for Elixir for this content warehouse API.
Apparently, Google had a now deprecated product (who would have guessed that? Consider me shocked!) that provided customers with a trimmed-down version of this database for their own purposes, but mistakenly published the internal SDK code instead of that intended for Google Cloud customers to GitHub.
So while this doesn’t directly show the search index source code, it describes the data schema of the index in great detail, so there are at least some interesting educated guesses on the workings of the actual index to draw from it.
advisedwang|1 year ago
[1] https://cloud.google.com/document-warehouse/docs/overview
9dev|1 year ago
This has been fixed, but the commit and all it’s changes are out there—and tragically, published alongside a copy of the Apache 2.0 license (intended for the document warehouse API SDK), which officially sanctioned freely copying and using the code. So there is really nothing Google can do about it.
[1] https://ipullrank.com/google-algo-leak
avallach|1 year ago
atonse|1 year ago
This can't possibly be the actual search index rules (which is probably code that's decades old, my guess is either in Python or Java?) – unless they rewrote all of it in the past few years?
Can anyone else confirm this?
9dev|1 year ago
Apparently, Google had a now deprecated product (who would have guessed that? Consider me shocked!) that provided customers with a trimmed-down version of this database for their own purposes, but mistakenly published the internal SDK code instead of that intended for Google Cloud customers to GitHub.
So while this doesn’t directly show the search index source code, it describes the data schema of the index in great detail, so there are at least some interesting educated guesses on the workings of the actual index to draw from it.
ChrisArchitect|1 year ago
Some more discussion: https://news.ycombinator.com/item?id=40496967