To Break Google’s Monopoly on Search, Make Its Index Public

[+] nostrademons|6 years ago|reply

Ex-Google-Search engineer here, having also done some projects since leaving that involve data-mining publicly-available web documents.

This proposal won't do very much. Indexing is the (relatively) easy part of building a search engine. CommonCrawl already indexes the top 3B+ pages on the web and makes it freely available on AWS. It costs about $50 to grep over it, $800 or so to run a moderately complex Hadoop job.

(For comparison, when I was at Google nearly all research & new features were done on the top 4B pages, and the remaining 150B+ pages were only consulted if no results in the top 4B turned up. Difficulty of running a MapReduce over that corpus was actually a little harder than running a Hadoop job over CommonCrawl, because there's less documentation available.)

The comments here that PageRank is Google's secret sauce also aren't really true - Google hasn't used PageRank since 2006. The ones about the search & clickthrough data being important are closer, but I suspect that if you made those public you still wouldn't have an effective Google competitor.

The real reason Google's still on top is that consumer habits are hard to change, and once people have 20 years of practice solving a problem one way, most of them are not going to switch unless the alternative isn't just better, it's way, way better. Same reason I still buy Quilted Northern toilet paper despite knowing that it supports the Koch brothers and their abhorrent political views, or drink Coca-Cola despite knowing how unhealthy it is.

If you really want to open the search-engine space to competition, you'd have to break Google up and then forbid any of the baby-Googles from using the Google brand or google.com domain name. (Needless to say, you'd also need to get rid of Chrome & Toolbar integration.) Same with all the other monopolies that plague the American business landscape. Once you get to a certain age, the majority of the business value is in the brand, and so the only way to keep the monopoly from dominating its industry again is to take away the brand and distribute the productive capacity to successor companies on relatively even footing.

[+] mrtksn|6 years ago|reply

I think it is possible to make way, way better search engine because Google Search is no longer as good as it used to, at least for me.

I can no longer find anything remotely good quality, I discover new and quality stuff from social media like Twitter and HN.

The search results seem to be too general and too mainstream. Nothing new to discover, just a shortcut to the few websites like Reddit, StackOverflow for more techie thing and Wikipedia and the few mainstream news websites for the rest.

I usually end up to search HN, Reddit or StackOverflow directly as the resulting quality is better as I can get easily specific. Getting specific is harder on Google because it just omits or misinterprets my search query keywords quite often.

[+] appleshore|6 years ago|reply

If there were viable alternatives, people would shift over time.

If I type in “<name> Pentagon” on Google, the first link is LinkedIn. DuckDuckGo doesn’t even list it at all. There’s countless examples where DuckDuckGo just can’t find basic information. DDG is just unreliable beyond it’s silly name.

[+] nojvek|6 years ago|reply

This ^ times a 1000.

Google simply has the best search product. They invest in it like crazy.

I’ve tried bing multiple times. It’s slow, it spams msn ads in your face on the homepage. Microsoft just doesn’t get the value of a clean UX.

DuckDuckGo results are pretty irrelevant the last time I tried them. There is nothing that comes close to their usability. To make the switchover, it has to be much much better than Google. Chances are that if something is, Google will buy them.

[+] Eridrus|6 years ago|reply

Sure, it costs $50 to grep it, but how much does it cost to host an in-memory index with all the data?

This is not a proposal to just share the crawl data, but the actual searchable index, presumably at arms length cost both internally & externally.

The same ideas could be extended to the Knowledge Graph, etc.

IMO the goal here should not be to kill Google, but to keep Google on their toes by removing barriers to competition.

[+] websitejanitor|6 years ago|reply

>The comments here that PageRank is Google's secret sauce also aren't really true - Google hasn't used PageRank since 2006.

That's quite a claim considering they were reporting PageRank in their toolbar until 2016, and toolbar PageRank was visible in Google Directory until 2011.

Are you talking about PageRank from the original patent?

[+] Xelbair|6 years ago|reply

>The real reason Google's still on top is that consumer habits are hard to change, and once people have 20 years of practice solving a problem one way, most of them are not going to switch unless the alternative isn't just better, it's way, way better.

I agree about consumer's habits, but not about quality - i mean google of today is worse search engine than google of 5 years ago.

Now google tries to guess, badly, what you meant, instead of giving you what you asked for. The pleasure of dealing with IT systems is that they give you what you ask them for, not what you meant - it introduces extra error, and worse - one that cannot be fixed by user.

I can rephrase my querry, and google will still interpret it - leading to same batch of useless results.

[+] burtonator|6 years ago|reply

I can also comment here. I built and still run a petabyte-scale web crawler:

https://www.datastreamer.io/

Common Crawl and other sources do in fact have a ton of data that can be used which is very affordable.

The DATA itself stopped being a real competitive advantage probably 2008-2010.

Google's major advantage now is its algorithms and the fact that they've proven it works and is reliable.

Most importantly, its the brand. Google MEANS search in the US and that won't change anytime soon.

PS,... if you need tons of web and social data Datastreamer can hook you up too :)

[+] bogomipz|6 years ago|reply

>"Indexing is the (relatively) easy part of building a search engine. CommonCrawl already indexes the top 3B+ pages on the web and makes it freely available on AWS."

Interesting I would have thought that crawling at this scale and finishing in a reasonable amount of time would still be somewhat challenging. Might you have any suggested reading for how this is done in practice?

>"It costs about $50 to grep over it, $800 or so to run a moderately complex Hadoop job." Curious what type of Hadoop job you might referring to here. Would this be building smaller more specific indexes or simply sharding a master index?

>"Google hasn't used PageRank since 2006." Wow that's a long time now. What did they replace it with? Might you have any links regarding this?

[+] unknown|6 years ago|reply

[deleted]

[+] devit|6 years ago|reply

IMHO a simpler and probably the only viable way to force competition is to legally force Google to not respond to any query on certain periodic time periods.

For instance, if you were to forbid Google from operating on every odd-numbered day, then 50% of the search engine market and revenues would immediately be distributed among competitors and furthermore users would be forced to test multiple engines and they could find a better one to use even when Google is allowed to operate.

Obviously this has a short-term economic cost if other search engines aren't good enough as well as imposing an arbitrary restriction on business, so it's debatable whether this would be a reasonable course of action

[+] jbverschoor|6 years ago|reply

Actually the omnibox made it really easy to switch to ddg. With an occasional fallback to google.

I have no problem with advertising etc. but the tracking and selling of data is such an idiotic thing. We as consumers should have a global internet-law, and be reimbursed for data leaks or usage outside the scope of the application.

By no problem with ads I mean the original ads of google. Very clear they were ads and not intermingled with the results. Scrolling down for the results is nuts. I will click ads if they’re relevant, regardless of if they’re on the right or in the results. So please stop supporting this fraud against advertisers.

[+] ChrisCinelli|6 years ago|reply

> when I was at Google nearly all research & new features were done on the top 4B pages, and the remaining 150B+ pages were only consulted if no results in the top 4B turned up

This may help to explain the poor quality of some the results on queries I run on Google lately that return content obviously written for ranking in SEO but that have very little value.

I have 2 questions:

- What make"the top 3B+" the top ones?

- How can I "force"a search on the other 150B+ pages?

[+] mda|6 years ago|reply

I find it odd that you claim to be a former Google search engineer and in the end boil down the success of Google search to brand recognition / loyalty. You kinda glossed over the insane complexity of building and maintaining a high quality search engine, really weird comment to be honest.

[+] tripzilch|6 years ago|reply

> you'd have to break Google up and then forbid any of the baby-Googles from using the Google brand or google.com domain name.

Just let "google" become the generic term for search, as it's already well on its way.

[+] Fion|6 years ago|reply

Page Rank is a synonym for link juice. So when you say Google hasn't used page rank since 2006, can you confirm that you are talking about link juice as opposed the the old toolbar representation of page rank? And assuming you do mean link juice, well why do links still work so well for seo?

[+] unknown|6 years ago|reply

[deleted]

[+] noahl|6 years ago|reply

Okay, this is a relatively serious proposal to require Google to allow API access to its search index, with the premise that it would democratize the search engine ecosystem. There are some issues with the regulations he proposes (you have to allow throttling to prevent DDoS attacks, and you can't let anyone with API access add content to prevent garbage results), but it's roughly feasible.

The main problem is, I think the author is wrong about what Google's "crown jewel" is. Yes, Google has a huge index, but most queries aren't in the long tail. Indexing the top billion pages or so won't take as long as people think.

The things that Google has that are truly unique are 1) a record of searches and user clicks for the past 20 years and 2) 20 years of experience fighting SEO spam. 1 is especially hard to beat, because that's presumably the data Google uses to optimize the parameters of its search algorithm. 2 seems doable, but would take a giant up-front investment for a new search engine to achieve. Bing had the money and persistence to make that investment, but how many others will?

[+] com2kid|6 years ago|reply

> 1) a record of searches and user clicks for the past 20 years

From what I can tell, Google cares a lot more about recency.

When I switch over to a new framework or language, search results are pretty bad for the first week, horrible actually as Google thinks I am still using /other language/. I have to keep appending the language / framework name to my queries.

After a week or so? The results are pure magic. I can search for something sort of describing what I want and Google returns the correct answer. If I search for 'array length' Google is going to tell me how to find the length of an array in whatever language I am currently immersed in!

As much as I try to use Duck Duck Go, Google is just too magic.

But I don't think it is because they have my complete search history.

Also people forget that the creepy stuff Google does is super useful.

For example, whatever framework I am using, Google will start pushing news updates to my Google Now (or whatever it is called on my phone) about new releases to that framework. I get a constant stream of learning resources, valuable blog posts, and best practices delivered to me every morning!

It really is impressive.

[+] dkyc|6 years ago|reply

> Yes, Google has a huge index, but most queries aren't in the long tail.

I'm not quite sure about that. 15% of Google searches per day are unique, as in, Google has never seen them before. [1]. That's quite an insane number.

[1] https://searchengineland.com/google-reaffirms-15-searches-ne...

[+] tryptophan|6 years ago|reply

>2) 20 years of experience fighting SEO spam.

Tangential - but does anyone else feel that google results are useless a lot of the time? If you search for something, you will get 100% SEO optimized shitty ad-ridden blog/commercial pages giving surface level info about what you searched about. I find for programming/IT topics its pretty good, but for other topics it is horrible. Unless you are very specific with your searches, "good" resources don't really percolate to the top. There isn't nearly enough filtering of "trash".

[+] dalbasal|6 years ago|reply

I would assess Google (& FB's) "crown jewel" as, ultimately, their market share, which is related to your points... and causation runs both ways.

The user data helps/ed Google create the superior UX, as you say. The reach is what makes Google & FB valuable to advertisers. A search engine with 0.1% of Google's user volume cannot charge advertisers 0.1% of Google's as revenue. Returns to scale/reach/market-share are very substantial in online advertising.

I'm glad we're talking though. Those tech giants are too powerful.

Ultimately, the old antitrust toolkit is near useless today, for dealing with tech monopolies. It's not obvious what "break up Google" even means. There are strong network effects and other returns-to-scale. It's a zero-marginal cost business, which was rare enough in the past that economists a ignored it.

We need fresh thinking, a new vocabulary, new tools, but we do need to deal with it.

[+] evrydayhustling|6 years ago|reply

> it's roughly feasible

What do folks even mean by "Google's index"?? Google results combine tons of signals, including personal histories for each user. Sharing metadata for the top billion urls wouldn't cover half the functionality, or make a competitive engine. And on the other hand, there may not be a single other organization in the world prepared to manage a replica of the entire data plane that impacts seatch. The proposal is somewhere between underspecified and nonsense.

[+] inlined|6 years ago|reply

> Bing had the money and persistence to make that investment, but how many others will?

I hypothesized once with an ex Microsoft HIGH up that it probably took 10B to launch bing. He said I was almost exactly on the nose.

Also this is a ridiculous thing to ask for. How much money do you think Google pays for the bandwidth to crawl the web? How much do you think it costs to run the machines that create indexes out of that? How do you value the IP involved in the process?

Google should give away the fruits of that labor for free, plus invest in a reasonable API to download that index? Plus the bandwidth of sharing that index with third parties? It’s probably not even feasible aside from putting disks or tapes on multiple semis to send to clients. The index is 100 petabytes according to [0]. With dual fiber lines, and no latency for mind bending numbers of API calls, that would take 12.6 YEARS to download a single snapshot.

[0] https://www.google.com/search/howsearchworks/crawling-indexi...

[+] detritus|6 years ago|reply

> Indexing the top billion pages or so won't take as long as people think.

This is what makes me wonder why we don't have a LOT of competing search engines. Perhaps i'm vastly under-estimating the technology and difficulty (I could well be - it's not my domain) but it surely it can't be THAT hard to spawn Google-like weighted crawl-based search results?

It's a long-since solved problem - heck, pageRank's first iteration recently came out of patent protection - it could just be copy'pastad. Why aren't all the big companies Doing Search?

[+] londons_explore|6 years ago|reply

> 1) a record of searches and user clicks for the past 20 years

If a government was serious about getting more players in the search industry, they would force Google (and all other players) to make this data public.

Simply say "All user-behaviour data used to improve the service must be freely published".

Make the law apply to any web service with more than 20 million users globally so small businesses aren't burdened.

If the data cannot be published for privacy reasons, the private parts must be seperated and not used by google or it's competitors.

[+] nova22033|6 years ago|reply

Caveat: The author is not a technologists

Robert Epstein (born June 19, 1953) is an American psychologist, professor, author, and journalist. He earned his Ph.D. in psychology at Harvard University in 1981, was editor in chief of Psychology Today,

He has also made some questionable claims about google manipulating search results to favor Hillary Clinton.

https://en.wikipedia.org/wiki/Robert_Epstein#cite_note-15

His research is based entirely on his own experience

“It is somewhat difficult to get the Google search bar to suggest negative searches related to Mrs. Clinton or to make any Clinton-related suggestions when one types a negative search term,” writes Dr. Robert Epstein, Senior Research Psychologist at the American Institute for Behavioral Research and Technology.

[+] tomweingarten|6 years ago|reply

The comments he made are not just questionable, they're outright wrong (and a great example of the problem with cherry-picking data):

https://www.vox.com/2016/6/10/11903028/hillary-clinton-googl...

https://www.politifact.com/punditfact/statements/2016/jun/23...

(Disclosure: I work at Google, but this opinion is my own)

[+] learnfromstory|6 years ago|reply

Just FYI the completion results in the omnibox have little to do with the search engine results. Clearly the search engine produces millions of hits for “Hillary Clinton emails”. The completions are a completely separate system based on what people type in the box, not what’s in the index, and it’s laser-focused on producing interactive results.

[+] aerovistae|6 years ago|reply

[deleted]

[+] klntsky|6 years ago|reply

> It is somewhat difficult to get the Google search bar to suggest negative searches related to Mrs. Clinton

Would be nice to know whether it was because of the search history bubble he lived in or not.

[+] swebs|6 years ago|reply

He does raise some good points in his findings:

https://sputniknews.com/us/201609121045214398-google-clinton...

I know Sputnik isn't a good source, but according to him, they were the only ones who would publish the findings without edits.

[+] briandear|6 years ago|reply

> He has also made some questionable claims about google manipulating search results to favor Hillary Clinton.

Despite it being off topic, can we define why those claims are questionable? Is their data proving those claims wrong? Because with all the Google political controversies over the past few years, and given the political donation history of Google employees, it’s highly plausible that search results are manipulated to favor certain politics over others.

If the “questionable claims” have been disproven or are inaccurate, then it would seem that you’d provide some proof. Essentially, it you are to claim the search engine was not biased towards Clinton, certainly there would be some proof of that? It’s more reasonable to suspect Google manipulating search engines than not, given the political environment at Google.

The real “questionable claim” is that Google is neutral in any way — which is kind of the entire premise of the article. If Google were completely neutral, then why would their monopoly on search need to be broken?

[+] zubspace|6 years ago|reply

From the article:

"But what about those nasty filter bubbles that trap people in narrow worlds of information? Making Google’s index public doesn’t solve that problem, but it shrinks it to nonthreatening proportions. At the moment, it’s entirely up to Google to determine which bubble you’re in, which search suggestions you receive, and which search results appear at the top of the list; that’s the stuff of worldwide mind control. But with thousands of search platforms vying for your attention, the power is back in your hands. You pick your platform or platforms and shift to others when they draw your attention, as they will all be trying to do continuously."

But this is a huge problem. I'd rather have 10 independent search providers instead of 10 companies proxying the results of google. It's worse, if I don't even know from which index the results come from. I guess, many people don't know, that Startpage shows you Google results.

I don't want Google results! I want different web crawlers ordering the results according to my taste without tracking each and every page impression of me. Give me that and I'll switch in a heartbeat.

[+] jacknews|6 years ago|reply

Maybe this article should be made public too.

[+] eaenki|6 years ago|reply

The effect of this on Alphabet's revenue would be nil.

The majority of Google's Revenue comes from Google, Youtube, Gmail and Play. They make so much $ because they have the biggest network effect of advertisers-eyeballs in the world along with Facebook. That. Is. Unbreakable. Even more than a social network's network effect, because the friction to switch budgets and people in a company is higher than a guy telling their best friends to download an app.

And then, YT is a network effect. And then, Play/Android is also a network effect. And then there's the branding. But presumably every big company has the latter. Still, what a brand. Everyone knows what Google or Android is. Every. Single. Human.

Finally, because they make all this money, they can pay to be the default on the other half of the devices, Apple's devices, to use Google as default. Last time I checked, $5B a year.

Hence, this article is so bad.

I don't even care about Google, just saying.

edit: did I mention Chrome? They've got chrome too, with the googleverse as default.

[+] taf2|6 years ago|reply

So although everyone likes to believe google is a monopoly it’s far from it. You have choices- bing, biadu , yandex, DuckDuckGo... there is also nothing about googles search position that prevents you from building a competitor. What we do have is peter thiel backing an administration that’s anti google, Russia, China that are anti google. Why? it’s a source of truth that challenges their lies. We also have an emergent anti ad - cult like backlash against personalized ads. So all of these factors combined and you get a lot of pressures mis information telling you google is evil. Additionally, karma , google led the charge against Microsoft with googles do no evil position against Microsoft- which did have an oem monopoly preventing others from competing. Anyways that is how I see it... so is google near to being a monopoly no I think they would need to be doing a lot worse things and there is room to compete and people should

[+] ga-vu|6 years ago|reply

This is dumb. You mean if I work two decades to develop tech that nobody else can copy, I have to open-source it because my competitors are dumbasses?

This is not how it's suppose to work.

[+] iainmerrick|6 years ago|reply

I don’t see any mention in this article of what seems like the most obvious way to split up Google, separating their search and ad businesses. (Edit to add: although maybe the effect would end up being similar, if API users serve their own ads but without access to Google’s ad infrastructure.)

That obviously wouldn’t be a simple job, of course, and maybe there are some interesting reasons why it wouldn’t work well.

[+] astonex|6 years ago|reply

A list of websites and their content is really not useful at all. Anyone can get this themselves with some really simple programming.

The actual hard part is when it comes to ranking and sorting the data in any useful way, and doing it within like 100ms. Plus various other issues like spam protection etc. This is where Google excels (at least in my opinion).

[+] seisvelas|6 years ago|reply

This seems a bit silly - it's not like Google's search results are that much better than Bing or DuckDuckGo (or better at all most of the time). Google has Google Colab, Chrome, and Android, and the whole Docs ecosystem, and they've integrated those things together pretty well while still being loosely coupled enough to switch out any part of the ecosystem with a competitor's product.

Is there a reason to break up Google other than that they are doing well? Other search engines seem to have no problem establishing their own niche and doing well.

[+] gigatexal|6 years ago|reply

I'm of two minds here: Google's whole reason for ascending to where they are is the PageRank algorithm which is why Google was created in order to monetize. I see this in similar veins to Apple and iOS: would we support calls for Apple to be forced to allow iOS to be installed on non-Apple hardware? If not, then why would we insist on Google giving up it's reason for being, it's reason a lot of us use it to find relevant information?

Then again, the concentration of power in a handful of operators likely threatens the open internet.

[+] peteretep|6 years ago|reply

I don’t want to break Google’s monopoly on search. Google’s search is fantastic. It’s their advertising business knowing too much about me I care about.

[+] sterlind|6 years ago|reply

I lean pretty far left, but this proposal makes me really uneasy. It seems very coercive and ham-fisted. I think instead, Bing should consider opening its index up. It'd be a massive PR coup, they wouldn't lose anything of much value, and anyone who put the index to better use than them would help them by breaking Google's stranglehold over search.

[+] dalbasal|6 years ago|reply

So... this article is a good example of how ??!? it gets once you move from "We gotta do something about these tech monopolies" into the "what should we do?" phase.

How exactly "do* you break up a Google or a FB so that (only one possible reason, but the one cited here) they don't control too much media/mind share?^

Laws usually want to be general, and my suggestion doesn't necessarily lend to that, but I'll suggest it anyway.

Facebook doesn't need to be broken up into several companies. It can just be shut down.

I don't mean that it should (justice-wise) be shut down. I just mean that we won't lack for social media. We will have social media alternatives the day after FB shuts down. Theres a chance we'll get something more open instead. There's a good chance we'll get several small replacements instead. There is 0-chance that we'll lack for ways to share posts and post pictures. This isn't Bell, where we need to keep the phones working. The phones will work fine with or without Facebook.

There is no need for a company to generate $70bn in revenue, in order for us to have social media. That's a key difference from all antitrust cases of the past.

YouTube is another sort of example. If it shuts down, alternatives will pop up with immediately...maybe open ones.

There are justice questions (is it fair to shareholders/employees/zuck?) There are legal validity questions (why FB and not apple?). But, for the practical questions... the problem is an easy one.

^fwiw, I also think this is the most worrying part. These companies have a tremendous control about how and what people think. They make Murdoch media look quaint.

[+] ng12|6 years ago|reply

I don't understand why we are talking about this. Google is far from the first "monopoly" that I would like to see broken up.

My guess is that Google's not lobbying effectively.

[+] ngngngng|6 years ago|reply

Fun timing to read this, this last weekend I was playing around with making my own search engine to understand better how ElasticSearch and Lucene work.

It occurred to me that the two most powerful things Google has to work with are records of clicks, and the time users spent on the webpages Google returned. I've argued against Google monopoly before because I can throw together a web crawler and search engine in a weekend, so it's not like it's a hard market to enter.

> According to W3Techs, Google Analytics is being used by 52.9 percent of all websites on the internet

This is the real problem though. When a search engine sees a new query, it uses everything it's got to assert which pages the user wants, but with Google Analytics, they can test their assertions constantly to see if a user actually wanted that web page. Then your future queries could be compared against previous queries that were validated by a user spending several minutes of active time on the returned page.

I'm sure Google's algorithm is great and all, but I really think this is what sets them apart.

[+] jedberg|6 years ago|reply

Good idea! We should make Bloomberg’s stock data public too!

[+] sct202|6 years ago|reply

I wonder if it's even in America's interests to create a weaker Google/Facebook/Amazon/Microsoft. These companies are dominating globally (excluding China) and bring back so much money, jobs, and influence to America. Weakening them might allow real foreign competition to flourish.

[+] ksahin|6 years ago|reply

"DuckDuckGo, which aggregates information obtained from 400 other non-Google sources, including its own modest crawler.)"

I looked into it, and it seems DDG is using Bing and Yahoo search API and lots of other sources. I looked into the pricing of Yahoo's search API / Bing search API,it ranges from $0.80 / 1000 queries to several dollars per thousand queries.

It seems to expensive to be economically viable with ads, what am I missing ?

597 comments