Anonymous Source Shared Leaked Google Search API Documents

precompute|1 year ago

This just proves all the "suspicions" privacy-conscious users have had about large corporations fingerprinting users, often in very obvious ways. There's often no better place to find ideas for surveillance than the people conscious about being surveilled.

p3rls|1 year ago

Many of the SEO suspicions were confirmed too.

I found it VERY amusing if you go to r/SEO just yesterday there were moderators and flaired users (you know, the elites of the SEO community, lol) insisting much of this was "debunked" years ago.

They of course deleted their posts, but the threads are still up. What a den of scammers over there.

https://www.reddit.com/r/SEO/comments/1d1eqjj/comment/l5tvfw...

https://www.reddit.com/user/WebLinkr/

I love how reddit is turning into the new SEO scam over night because of this stuff. Great work as always Danny Sullivan!

theolivenbaum|1 year ago

Seems like a lot of it came from them inadvertently posting some internal API to GitHub: https://github.com/googleapis/elixir-google-api/commit/078b4...

renegade-otter|1 year ago

I guess too many people got laid off to do the whole "three reviewers per PR" thing!

dontdoxxme|1 year ago

And it's Apache licensed, which grants a patent license. Some of the comments refer to specific aspects of how page rank is calculated. Pagerank itself is past patent protection but I wonder if this also accidentally might grant licenses to other patents.

ec109685|1 year ago

Oops, someone’s script was too greedy when uploading those elixir api documents.

xnx|1 year ago

I believe these are the leaked docs: https://hexdocs.pm/google_api_content_warehouse/0.4.0/api-re...

precompute|1 year ago

> My anonymous source claimed that way back in 2005, Google wanted the full clickstream of billions of Internet users, and with Chrome, they’ve now got it. The API documents suggest Google calculates several types of metrics that can be called using Chrome views related to both individual pages and entire domains.

What answer do the engineers at google working on this have for this violation of privacy?

GuB-42|1 year ago

I am not an engineer at Google but this is I would say if I was.

We don't know who you are, you are just a number in a database, and we don't even know what number, we just get the total number of visits for each website, not who visited it. It is like counting cars on a highway, not following your car. Plus, it serves the useful purpose of providing you with better search results, the terms and conditions allow it, and it can be disabled.

raxxorraxor|1 year ago

That would be money. If someone has another excuse, they are naive or lying to themselves.

It certainly is not "to improve the net or advertising" - that would be the lying part.

Google has done some good for the net, but the scales of their contributions slowly but steadily move to the negative side.

danpalmer|1 year ago

Personal (not work related opinion): This basically can’t happen with things like DMA and GDPR. DMA in particular means you can’t share data across “products” without explicit consent. So you could for example collect websites that don’t work for the purposes of improving Chrome, but not then share that with the Ads/Search orgs for personalisation or targeting, as far as I understand the legislation.

Personal opinion about work at Google (still not googles opinion) I’m consistently impressed with how seriously this stuff is taken and the amount of work that goes into making sure that things like this sharing can’t happen accidentally, and that user choice is respected. The engineers on the ground are absolutely making sure this all works, and most of us care deeply about user privacy. I have personally worked both on implementing new features that significantly push forward privacy, and on implementing privacy controls for regulatory purposes.

unknown|1 year ago

[deleted]

marcinzm|1 year ago

> What answer do the engineers at google working on this have for this violation of privacy?

The same answer you probably have for the millions of questions about what the things you do that some other people find offensive to their personal views and beliefs.

bdlowery|1 year ago

How is it a violation of privacy. Did you read the terms of service?

vouaobrasil|1 year ago

Sometimes I wonder how much better the internet would be hits on Google weren't directly tied to revenue from Google itself through its ad program. I am certain Google has made the internet and the world a worse place to live.

eitland|1 year ago

As a user of Kagi and search.marginalia.nu I can tell you:

Quite a bit.

So much that now that I have what "everyone" asked Google for for years - that is blacklists - I hardly use them.

Why? Because with Kagi I get much better results out of the box.

I am fairly sure Googlers will tell me there are multiple safeguards to prevent the inclusion of Google ads from affecting ranking, to which I just have to say that the results speak for themselves.

Please note: I have only used Kagi for two years. I am only one user. But I am a user with 20 years of experience with Google and that got to count for something.

Workaccount2|1 year ago

The fundamental problem with the Internet is that people don't want to pay for things on it.

No matter what, whatever we ended up with was going to be shitty and exploitive.

wslh|1 year ago

Google was really great and revolutionary, they helped zillions of small companies to thrive. It was another cycle.

Then, now, it is like media before the 90s: you need to pay a lot of money to be in the center page of the newspaper.

But, hopefully we are talking about LLMs now, seems like one of the answers to search engines in general. Beyond AI, I see LLMs as a good evolution from PageRank.

A little bit general but lately I use the expression: "Complexity as Scam". Google always pointed to their "algorithms" and played with this term as if algorithms couldn't be adjusted to whatever you want to be. Initially the coined term was sound because it was based on a scientific paper and eventually it evolution but it seems like the PageRank original idea has detoured from being a "pure" graph algorithm.

Another context where I use "Complexity as Scam" is Web3. It is like Matryoshka dolls where there is always one more step of complexity to probe a point, but it never ends.

benterix|1 year ago

It's not black and white. There was a lot of junk that was forced on us and that was removed thanks to Google. But I agree the direct relationship is inherently corrupting.

DarkNova6|1 year ago

You mean the way Google worked originally? The founders were very careful in creating a barrier between ads and search.

A barrier whose erosion has been well documented over the last 10 years.

heresie-dabord|1 year ago

Instead of a semantic Web of knowledge, we got "grep the HTML... with ads".

greg_V|1 year ago

I mean... maybe, but not really. The first problem of the internet was that there wasn't that much content specifically. The first internet companies were the broadband providers who were developing content themselves, like AOL.

Google and the ad ecosystem they acquired was basically the flywheel that spurred content creation at scale. Anyone could jump in, follow a few guidelines and earn a living by producing content on the internet. The Youtube acquisition and monetization followed the same pattern.

Over time the market consolidated and got less and less competitive: less platforms with complete control of traffic and one-sided revenue sharing agreements. The guidelines so to speak on how content should look and feel like were algorithmically made stricter and stricter until everything looks, feels, sounds and reads the same.

The problem right now is that the platforms are still tightening their grip, and it's all tied to the approach of using AI to replace the content creators on the platforms from Google to Spotify to Meta, and carving the spared money to shareholders. And while the web has been shitty for a few years now, we're now seeing a sudden drop in quality because the average user has no recourse or alternative, and neither does the average creator have the means of distribution and monetization (not just publishing, that's been solved) to even find, let alone meet the new kinds of demand.

I'm certain that in a few years this will even out: new search engines, new aggregators and new feeds will emerge, but the content - money - network problem triangle remains as a fundamental problem of the internet.

linsomniac|1 year ago

Did you experience the Internet before google? The idea of a world where Alta Vista won is truly chilling.

blowski|1 year ago

I imagine it would be a different flavour to what we have today, but the same intensity. Anything that so deeply penetrates daily life across the globe is going to bring enormous problems with it.

1vuio0pswjnm7|1 year ago

There is something truly strange about the idea than people "trust" a website operator and can rely on it to provide them with useful information when that same operator is well-known to be secretive, deceptive and dishonest in order to protect its own interests. It's like imagining that a fact witness who tells the truth on some occasions and lies on others is credible.

https://ipullrank.com/google-algo-leak

nsmog767|1 year ago

I work in search and didn't find anything surprising in here. But that's mostly because I've just assumed Google has been lying for years about many things, such as not using click data or Chrome data.

I've directly seen people who have successfully manipulated search rankings by having logged-in chrome users search for a term, and then click on a given page. Works like a charm (though may not stick once the manipulation is done, unless organic users also prefer it).

ec109685|1 year ago

If anyone is surprised about chrome sending urls to Google, you can turn the “feature” off by unchecking “Make searches and browsing better” in the sync section of Google chrome settings.

Creepy.

HenryBemis|1 year ago

Or, and hear me out, you never use Chrome again, in any platform.. like ever ever again.

Terr_|1 year ago

"But what if I don't want my own computer to build and share a detailed profile of everyone I know, everywhere I go, all my preferences, and how to manipulate me?"

"Well obviously it's your fault for not picking the 'Don't Be Cool' option on subpage 27b-6, duh!"

andrybak|1 year ago

> unchecking “Make searches and browsing better”

Before that, you can make it audible: <https://github.com/berthubert/googerteller>

precompute|1 year ago

Is that part of Chrome not open-source?

noman-land|1 year ago

Imagine thinking you can escape your abuser by living in their house and asking them politely to stop.

thih9|1 year ago

> Thousands of documents, which appear to come from Google’s internal Content API Warehouse, were released March 13 on Github by an automated bot called yoshi-code-bot

Does anyone know more about yoshi-code-bot and how were these documents suddenly published?

Was it a script misconfiguration? A manual push? Something else?

chx|1 year ago

https://github.com/yoshi-code-bot

Created 1,891 commits in 19 repositories

All 19 is under googleapis

This looks like a bot Google uses to publish their stuff on github and so likely it's a misconfiguration.

ilrwbwrkhv|1 year ago

And that's why if a developer doesn't use Firefox and uses Chrome, they are just helping a monopoly take over everything and make a mess.

dgellow|1 year ago

Any user, not just developers

metadigm|1 year ago

As soon as they add the ability to configure shortcuts, I'd more than happy to. After several years of requests, we're finally seeing some movement on their end.

precompute|1 year ago

From the article:

Boosting "organic traffic":

- Brand matters more than anything else

- Experience, expertise, authoritativeness, and trustworthiness (“E-E-A-T”) might not matter as directly as some SEOs think.

- Content and links are secondary when user intention around navigation (and the patterns that intent creates) are present.

- Classic ranking factors: PageRank, anchors (topical PageRank based on the anchor text of the link), and text-matching have been waning in importance for years. But Page Titles are still quite important.

- For most small and medium businesses and newer creators/publishers, SEO is likely to show poor returns until you’ve established credibility, navigational demand, and a strong reputation among a sizable audience.

TL;DR: Clickbait + bot farms are the way to go. No wonder the internet is going to shit.

BillFranklin|1 year ago

FYI, it's much easier to read the linked GitHub code via the published docs at https://hexdocs.pm/google_api_content_warehouse/0.4.0/api-re...

BillFranklin|1 year ago

In particular, https://hexdocs.pm/google_api_content_warehouse/0.4.0/Google...

Notably, for people on HN, it looks like there is indeed an internal initiative to promote small personal blogs :-)

> smallPersonalSite (type: number(), default: nil) - Score of small personal site promotion go/promoting-personal-blogs-v1

llmblockchain|1 year ago

> GoogleApi.ContentWarehouse.V1.Model.AppsPeopleOzExternalMergedpeopleapiAboutMeExtendedDataPhotosCompareDataDiffData

Java, is that you?!

deely3|1 year ago

https://news.ycombinator.com/newsguidelines.html

> Omit internet tropes.

lazide|1 year ago

Missing the ‘ManagerAgentUtil’ at the end.

isaacfrond|1 year ago

Most of the factors in ranking a page are no surprise. But i was surprised that having Product reviews on your site is apparently a demotion? Surely, many people are searching to find just that?

unnamed76ri|1 year ago

Years ago I had a site for deep fryer reviews. The whole thing existed to make money from Amazon’s affiliate program. I hadn’t personally used ANY of the deep fryers. Was just writing reviews based on features and other people’s reviews. In short, I ranked high in Google and added nothing of value to the world with that site.

There was a brief period of time where I made decent money with it until Google deranked all the product review websites.

b112|1 year ago

This is likely more about reviews with affiliate links. 99.99% of those are people reviewing absolutely nothing, just copying reviews and putting their own affiliate link.

zeroCalories|1 year ago

Sites spam low quality product reviews with affiliate links to Amazon. This is done by "reputable" sites as well. I don't blame Google for down ranking this meta.

nottorp|1 year ago

We are, but I’m not sure there are any real product reviews left on the internet.

cqqxo4zV46cp|1 year ago

“xx,xxx five star reviews” I’ve found is a modern day over-marketed product trope. It feels well within the realm of reasons that this ends up serving as a useful heuristic.

yieldcrv|1 year ago

I don’t trust conflicts of interest, if that’s about a site selling it’s own product and having reviews, I’m glad to find that results in a demotion

While bigger marketplaces have other ways of driving ranking

ren_engineer|1 year ago

most of these have been outright publicly denied by Google employees, despite people showing with A/B tests that things like CTR and backlinks impacted rankings

skilled|1 year ago

I would usually call this a dupe but this article and the other one from SparkToro are completely different even if they are on the same topic.

Haven’t had a chance to look at the API myself but the first impressions are that a lot of this was suspected by SEOs, but Google kept rejecting the ideas. Looks like clicks increase ranking for sure, which means click farms definitely have a legitimate business solution to offer.

JSDevOps|1 year ago

Seriously considering switching back to Firefox after all these years.

jasonsb|1 year ago

What's stopping you? I use both browsers and I see no reason why someone would pick Chrome over Firefox at this point in time.

GuB-42|1 year ago

I have used both for many years, and now, I see little difference in practice. I am leaning more towards Firefox these days. Main change is that I now use Firefox as my main mobile browser for ad blocking reasons. A few websites don't work on Firefox, I use Chrome for these few.

I don't consider it a problem to use two browsers at the same time, I usually don't to the same thing with them, so having separate profiles can be an advantage.

Note that privacy is not the reason why I am using Firefox. It is just that I think that knowing both is a good thing, and they are both good browsers, so why not? In some case, Firefox is better, in others Chrome is better, most of the times, they are interchangeable.

mind-blight|1 year ago

I've been using Firefox since Chrome forced users to sign in to the browser with their Google account, and I'm quite happy.

The only time it's a problem is when a site detects Firefox and won't display unlocked your using chrome or IE. I've only seen that a couple of times in the years since I switched back

WhyNotHugo|1 year ago

Firefox is better than Chrome [in the privacy aspect]... but still pretty terrible.

It sends a lot of "analytics" and "tracking" to some of Mozilla's servers, but if you inspect the requests, those servers are actually behind Google's CDN,and Google does the TLS termination.

So... Google has access too all the data that Mozilla sends when it phones home. Some of it even has a unique identifying id.

Ringz|1 year ago

I've been using Firefox since the days when it had the other name. Meanwhile, I use Floorp [1], which is based on Firefox, but offers much more possibilities for customization. I am very satisfied, except for the stupid name...

[1]: https://floorp.app/en/

rpgbr|1 year ago

Go for Firefox and keep ungoogled-chromium[0] for those sites that refuses to work properly on non-Chromium browsers.

[0] https://github.com/ungoogled-software/ungoogled-chromium

garbagewoman|1 year ago

… just considering?!? What is it gonna take

9dev|1 year ago

I found it interesting that the docs mention "site2vec" scores. This implies, I think, a variant of word2vec or document2vec, but for the full site; so probably a vector sum of the doc2vec scores of all individual pages?

HankB99|1 year ago

> Successful clicks matter.

I wonder about this. If I click a link and read it and I find that it's garbage (e.g. got ranked based on SEO rather than useful content) does it count as a successful click? Worse yet, some of these sites have blatant errors that are only discovered after examination.

This is relative to technical subject matter. Other searches, such as shopping may not suffer this kind of problem (or I have not noticed it.)

I also wonder how Google knows a click is successful. If I open a link in another tab, does the browser tell Google how long I lingered on the site? Perhaps Chrome does but I use Firefox.

EcommerceFlow|1 year ago

Once you get to the top 1-3 results, CTR (click through rate) is a much bigger ranking factor. Google knows how long people stay on pages and whether they click and back out immediately. This is important for E-Commerce, because Google doesn't want Site #1 to be mostly out of stock even though they have better links.

badgersnake|1 year ago

Something like this I guess:

var words = query.split

var results = executeQuery( Select * from AdWords aw where word in query inner join adlinks al on aw.id = al.id return al.url, al.desc)

If (results.size < 30) { // todo call search engine }

Return results

ilyazub|1 year ago

It doesn't look like a leak but a misdeployment.

Same service wrappers from two years ago: https://github.com/googleapis/google-api-php-client-services...

usui|1 year ago

> Prior to the email and call, I had neither met nor heard of the person who emailed me about this leak. They asked that their identity remain veiled

And yet the journalist included a screenshot with one of the weakest blurs I've ever seen... Why would you not excise the person's video portion completely? What good does it serve to have it included in the story? Even if that portion is faked, why would you offer potential signals like skin complexion, hair color, background picture, etc.? Why...

mtlynch|1 year ago

The author is Rand Fishkin, who's not a journalist. He's the founder of SparkToro and Moz, both companies that provide tooling and analytics for SEO.

I haven't looked deeply into Fishkin's companies, but I wouldn't expect either to be on the user's side when it comes to privacy. Both companies seem to monetize clickstream data and personal information from users who probably didn't give informed consent.

If the source was trying to get this information to a responsible journalist who cares about privacy, I have no idea why they'd approach a company (not even a news organization) who seems to fund the erosion of user privacy.

krackers|1 year ago

>weakest blurs I've ever seen

Isn't this the same type of "swirl" blur that Interpol was able to reverse even 10 years back? With advancements since then you're basically handing evidence on a silver platter.

txomon|1 year ago

To make it worse, he made clear when the call had happened, and you have: 1) Who was in the call 2) When the call happened 3) A blur instead of a complete black out

I'm not sure I would feel safe reporting stuff to journalists nowadays.

roastedpeacock|1 year ago

That also struck me as odd. And seemingly a violation of journalistic best-practices of protecting sources. I sure hope this was done with consent of the anonymous source.

Control8894|1 year ago

It's a fake background.

It's also clearly from Google Meet so... yeah. If he was worried about retribution (from Google, anyway) then they probably wouldn't have been using a Google service.

adrianvincent|1 year ago

The algorithm is probably so complex and bloated at this point I doubt even Google knows how it really works

stonogo|1 year ago

We call that "AI" in the web world nowadays. It's a feature! You can't game a system you can't understand.

cyanydeez|1 year ago

If($) return true

// TODO: search

adamgordonbell|1 year ago

Where is the link to the document?

pr337h4m|1 year ago

https://github.com/googleapis/elixir-google-api/commit/078b4...

https://hexdocs.pm/google_api_content_warehouse/0.4.0/api-re...

zarathustreal|1 year ago

Hopefully this doesn’t surprise anyone..if Google actually told us correct information about how the search algorithm works it would be abused immediately

pembrook|1 year ago

What I find most interesting about this is that a lot of supposed "smart" algorithms of Big Tech are in fact a patchwork of "dumb" rules rules and human-picked winners. This would explain why the quality of search results is failing to keep up with developments in LLMs.

This also explains why it's impossible for incumbents to unseat the winners in many search categories -- because they've literally been picked as the winners by humans at Google.

Looking at my Twitter/X feed, I also see an oddly similar dynamic. Certain accounts appear to have been manually boosted, showing up all the time -- whereas others posting even the same exact content will never appear.

Silicon valley will loudly tell you all about how wonderful they are at "democratizing," however, if you look under the surface it appears they're just hand picking the winners.

trogdor|1 year ago

> because they've literally been picked as the winners by humans at Google

Is there evidence of that in the leaked documents?

alun|1 year ago

Maybe this is an unpopular opinion, but if a search algorithm is truly designed to showcase the best content, then making it transparent shouldn't lead to manipulation

8note|1 year ago

For those out of the know, what's a "crap" in this? A "crap crap"?

throwaway743|1 year ago

... why the hell would an anonymous source use google meet to share info on google? ... so much for remaining anonymous :/

jgalt212|1 year ago

> A sample of statements from Google representatives (Matt Cutts, Gary Ilyes, and John Mueller) denying the use of click-based user signals in rankings over the years.

renegade-otter|1 year ago

There are so many Kagi fans on HN that it's a matter of time before the Big G buys it and shuts it down, like hundreds of its products before.

SadCordDrone|1 year ago

Didn't read article fully, but - since it's protocall buffer definitions, what if these fields are there for backward compatibility?

Havoc|1 year ago

Does it also recommend eating at least two stones a day?

StevenNunez|1 year ago

Wait... There's Elixir to be done at Google?!

unknown|1 year ago

[deleted]

dentemple|1 year ago

TL;DR Google lies about how its search algorithm works.

eitland|1 year ago

Would be interesting to see if any relavant authorities could be interested now that this is out?

I understand some of this is a direct contradiction of things they have said in court previously?

sharpshadow|1 year ago

[deleted]

ChrisArchitect|1 year ago

[deleted]

iamacyborg|1 year ago

It’s not a dupe, the ipullrank.com article goes into much more depth than the one from sparktoro.

anynimous123|1 year ago

[deleted]

Aldipower|1 year ago

If there are really 14,000 attributes, most of them will have a weight near 0, thus are irrelevant. If they would be all heavy weighted, the ranking would be rendered irrelevant due to the sheer amount of attributes.

beejiu|1 year ago

Isn't that where deep learning comes into play?

ozehlaw|1 year ago

Yes, this makes sense. I think the only good thing from the leak for Google is that the scoring values are not present

unknown|1 year ago

[deleted]

296 comments