Let’s take that thought to its conclusion. If clicks on Google really account for only 1/1000th (or some other trivial fraction) of Microsoft’s relevancy, why not just stop using those clicks and reduce the negative coverage and perception of this? And if Microsoft is unwilling to stop incorporating Google’s clicks in Bing’s rankings, doesn’t that argue that Google’s clicks account for much more than 1/1000th of Bing’s rankings?
My take on this is quite the opposite. If MS thinks they're doing nothing wrong, they shouldn't stop. They should do what is best for their customers.
Reversing what they do now because Google "caught" them, would IMO, imply they were doing something wrong and got caught.
I personally think what Bing is doing is great. I regularly use Bing and Google. I wish there were a streamlined way for me to broadcast to both of them... "for search term X this is the relevant link!"
MS has found a way to do this, and as long as its opt-in, I like it. I wish Google had the same (although I do realize that since Google owns so much of the traffic it is less of an issue for them -- its a net loss in terms of the flow of information).
But maybe a question for Matt... Can you and MS work together for a toolbar that does just this for both engines?
As a user this isn't a matter of Bing copying Google or not. But its about the fact that search relevance still kind of sucks. I feel like I can make the search experience better, especially in those occassions when I go to your competitors site.
I wonder if you would be ok if Bing did the same thing to amazon. That is, imagine they used toolbar/IE logs to infer that people went to amazon, searched for LCD TVs and then purchased model X. Then they could boost pages about X in search, or implement a "bestselling" feature. After all, they are "just" using clickstream data here. Similarly, they could track Netflix, and so on.
IMHO, there's a loophole in Bing's argument that the user click data is free for them to use. What about the site - do they have no ability to consent/dissent to being tracked?
Maybe robots.txt needs to be updated for the toolbar.
What is being data mined is a bit more than a user broadcasting their own preference on the correct result. The user is broadcasting a URL which is selected based on two factors:
- the user's preference
- Google's ranking algorithm.
Had Google not ranked that URL, the user wouldn't be broadcasting it. If there was some way to extract the factor of the user's preference of URLs as a signal without the factor of Google's ranking algorithm, you would be more spot on. Unfortunately this isn't possible. Also, in most cases the ranking algorithm component of this broadcasted data is the more useful factor of the two.
I agree with you. Particularly, why not just stop using those clicks and reduce the negative coverage and perception of this seems like a very biased suggestion.
All of the negative perception seems to come from Google deliberately inserting outlier data. For MS to back down on that basis immediately makes them look like they were doing much more than what is proven.
Best for their customers in what time frame? They might get better results in the short term, but it's much less clear what the long term effects are.
What if it slows down innovation in search quality and moves effort into marketing, shiny baubles? After all, why bother spending resources on researching, implementing and evaluating the algorithms when your competitors will within a couple of months gain the most of the improvements, automatically with no effort.
You complain about search relevance sucking. Why are you so eager to support a practice that has a high chance of causing it to stagnate?
> My take on this is quite the opposite. If MS thinks they're doing nothing wrong, they shouldn't stop. They should do what is best for their customers.
Does this mean anyone should feel free to ignore robots.txt if they think it will be better for their customers?
Agreed. And his example of how his mom could not understand what the IE disclosure really meant in terms of information sharing is silly. Your typical mom would also does not understand the true extent of data collection from using Google's own products. Does that make it wrong?
If this joint toolbar comes to fruition, which it never will, the crowd will decide the obvious winner for practically all queries. At that point what will be the difference between Bing and Google?
"Copying" was a pretty brutal word to use-- not surprising that it raised MSFT's hackles a bit.
MS clearly uses toolbar users' clickstreams (on and off Google) to improve their own search efforts. Google created an artificial scenario where the ONLY input was Google search behavior and lo, the search results are exactly the same. Whether or not that steps over a line (I don't feel that it does), it's not "copying" in my book.
Another interesting point is that Google has been beating the "open" drum for a long time now. No walled gardens, right? If a Facebook user should have the right to take his data with him wherever he wants to go, shouldn't a Google user be able to fork over their behavior data to Bing?
Matt's point about MS' lack of clarity when getting folks permission to grab their clickstream was dead-on. THAT is pretty outrageous and MS should be ashamed of that.
Regardless of all that, hats off to Matt for keeping a cool head and stating his position in a respectful way.
I don't understand how this isn't a settled issue. If clickstream data is 1 of 1000 signals, and you create clickstream data for a specialized query that will never trigger off another signal, then your created data will be reflected. That sounds exactly like what happened.
You'd have to make the argument that using this data is wrong, somehow. But to make that argument, you'd basically have to argue that users shouldn't be able to share their habits with whomever they want to. I doubt that argument can be made in a compelling way.
I'm surprised Google is pushing this further, and a little disappointed.
"Google created an artificial scenario where the ONLY input was Google search behavior and lo, the search results are exactly the same."
Cutts addresses this:
"As we said in our blog post, the whole reason we ran this test was because we thought this practice was happening for lots and lots of different queries, not simply rare queries."
Imagine I launch a search engine with no data. Then I feed it with urls IE users click on after their google search.
I will eventually end up with an exact copy of google database. So I think that this technique can be called "copying".
Now if Bing uses this technique for 0.1% of their data, then it can be said that 0.1% of their data are copied from Google database.
Matt's assertion about "Suggested Sites" sending this data seems to be conjecture. I ran some packet captures and didn't find anything of that kind.
However, if you install the Bing Toolbar then it does send URL clickstream data. It explicitly asks you beforehand if you want to send info about "the searches you do, websites you visit..." though.
Excellent analysis, thank you, especially in the distinction between "suggested sites" and the Bing toolbar's behavior. I can't argue with your methods. However, I do differ with some of your conclusions:
> The behaviour I’ve seen explains Google’s experiments, but does not support the accusation that Bing set out to copy Google.
I don't think it's so much about "set out to copy Google" necessarily, as it is that they are explicitly parsing Google queries and results from the clickthrough data and using it (quite directly) for their own results. What they set out to do is immaterial given what is provably happening.
> Bing Toolbar is tracking user clicks and Bing could use the result to improve search results. I don’t personally see any great distinction between this behaviour and Google’s many tracking, indexing and scraping endeavours which they use to improve their own search results.
The difference is that Google has proven that the results of certain queries are being directly fed as Bing results. If Microsoft does the same with Google rankings, I'd see your point, but right now the evidence only points in one direction.
> While I personally dislike the privacy implications, Bing Toolbar is pretty upfront about it when it gets installed (unlike much web page user tracking.) The fact that the tracking is plain HTTP not HTTPS, with the content in plaintext, would seem to indicate that they weren’t seeking to hide anything.
I'd be interested to see if Google over HTTPS queries are being transmitted by the toolbar over HTTP. That would be a pretty serious privacy violation IMO, especially when you pair that with unencrypted wifi at Starbucks. See the AOL search log fiasco: http://www.somethingawful.com/d/weekend-web/aol-search-log.p...
Thank you. And this is precisely the problem with this whole fiasco--its just a bunch of conjecture and assertions. But because its from Google its taken with much more weight than it would be otherwise.
Google has created precisely the conversation they wanted based on their assertions then gets indignant when Microsoft refuses to step into it.
The article in the WSJ insinuates that it can come from either IE or the toolbar:
Stefan Weitz, director of the Bing search engine at Microsoft, said in an interview the company studies how certain users interact with Google in order to improve Bing. It does this by looking at "clickstream data," or information that users of Microsoft's Internet Explorer or the Bing search toolbar voluntarily share with the company
Fascinating. I was planning on doing the same thing and now I don't have to -- thank you for taking the time. As you mention, this stuff could all be happening server-side, of course.
Oh man. That research paper he quoted strongly indicates that MS is specifically targeting Google.
I was totally wrong. Bing are copying Google. I'm sorry (I am the "What on earth are Google thinking" author).
I still think the honeypot experiments didn't support the conclusion. But this paper coupled with Bing's lacklustre pseudo-denial strongly indicates that my view of events was not accurate and that Bing were indeed blatantly piggybacking off Google's hard work.
I'm really disappointed. I gave Bing the benefit of the doubt, saw that there was another conclusion which explained Google's observations and I advocated that. But now it really looks like MS overstepped the mark and were deliberately picking out Google URLs in order to use signals from Google's algorithms to improve Bing.
That really is cheating.
edit: although still it's not conclusive. I don't know what to think now. I think I should stick to writing code :)
In the spirit of completeness, the paper says they use Google, Bing, and Yahoo.
But the paper doesn't say this is what Bing uses. Looks like it, but we should be clear. This is MSR, not Bing.
Good paper though. But it is another example of MS actually not thinking its bad. They published a paper on this where they spell out how MSR would do this. Did people in the academic community protest that this would be copying if implemented?
Those examples to me are like finding a maggot in a piece of meat. Maybe that's the only one, and if you just remove it the rest is ok, but it's going to make you pretty suspicious especially if it already smells.
The reality is that it's almost impossible for google to measure how much their results influence bing. After all, they are both trying to rank the same internet. The best they can do is prove the obvious cases and let bing do the explaining.
user24, I didn't make the connection that you did those "What on earth are Google thinking" posts--just wanted to say thanks for your tweet earlier. It made me feel like doing the post wasn't such a bad idea.
I haven't got much to add since I've expressed my views previously, plus this blog post does a good job of calmly pointing out the issues. It's worth emphasising the penultimate paragraph from the blog post since I feel it's key here though:
"Since people at Microsoft might not like this post, I want to reiterate that I know the people (especially the engineers) at Bing work incredibly hard to compete with Google, and I have huge respect for that. It's because of how hard those engineers work that I think Microsoft should stop using clicks on Google in Bing's rankings. If Bing does better on a search query than Google does, that’s fantastic. But an asterisk that says "we don't know how much of this win came from Google" does a disservice to everyone. I think Bing's engineers deserve to know that when they beat Google on a query, it's due entirely to their hard work. Unless Microsoft changes its practices, there will always be a question mark"
I agree with this entirely. Given Bing's resources, it seems bizarre that they were relying on Google's results even a little bit. And until they sort this issue out, I definitely agree that it's difficult to know when to give credit to Bing.
(Plus it's made all the more difficult considering that Bing keep sort of denying this whole mess, despite the quite conclusive proof. I suspect that if they do stop using Google's results, we won't know about it considering the hole that Bing have dug themselves with their denials. Ah well, que sera sera).
Matt Cutts should know better. Being part of '1000 signals' does not mean all signals are weighted evenly. It does not even mean the signals are weighted the same across all query types. This is machine learning - the actual weighting is learned and dynamic (always shifting) and not controlled. And there is absolutely no reason for Microsoft to take out a particular signal just because Google asked. There needs to be proof of unethical behavior, of which there is none.
The Chrome and Gmail EULA's are as bad as the IE8 Suggested Sites bit mentioned in this article. GMail basically reads your email to provide contextually relevant ads. Does my mom know that? No.
Not a plug, but I blogged about this @ http://roshank.posterous.com/google-versus-bing-no-one-is-be... . I believe this should be a discussion on ethics - and feel it is ethical for a company to do whatever it wants with data contained in its own software application.
I don't think you read the whole article, because Matt quotes Nate Silver in saying that exact thing
You said: " -Being part of '1000 signals' does not mean all signals are weighted evenly."
Matt quotes Nate: "First, not all of the inputs are necessarily equal. It could be, for instance, that the Google results are weighted so heavily that they are as important as the other 999 inputs combined."
Google should also quit trying to build the invite-your-facebook-friends feature in order to get virality for their services. Because, you know, it does a big disservice to their engineers and marketing people that they must rely on facebook.
And if they are not relying on facebook, why not just get rid of it all together?
"Because it helps the users".
Oh I see. But so does Bing's actions.
"Because its individual users giving permission"
Same with Bing.
I think Matt lays out a strong argument for why they have made such a stink. And it sounds like a pretty compelling defense for their actions to go public.
But...
1) I think they will regret it bigtime when all the attention they are causing makes the bored government officials poke their head in and realize that at a macro level, neither Bing nor Google does anything to protect user search privacy.
2) I think Google has more to lose by bringing this to light than they have to win. Despite Matt's defense, it's hard to see it as anything other than being petty and pedantic. But this is there response to anything that threatens their search pageviews, and that's understandable even if erroneous.
They should focus on trying to be innovative again, that was the Google I respected.
> They should focus on trying to be innovative again, that was the Google I respected.
I work in search quality at Google. I'm busting my ass every day working on fundamental reimaginings of how results get ranked. I'm going to keep doing that regardless of what bing does, because it makes the world a better place, and it's fun.
But, suppose the stuff I'm working on works out, and tomorrow Google shows up with a wholly new set of awesome results. This is very possible, there is a ton of headroom left in search quality, I've seen the experiments myself.
Then after a few weeks of sniffing clicks, Bing comes up with the same set of revolutionary results, but they have no idea how they got there, they have no idea what the evidence is to rank them there, all they know is that people like those results on Google. Is this fair? Is this ethical? Is this even legal?
People in this whole debate have the idea that the user is creating this association between the result and the query, like the user searched through the whole web and came back saying "hey microsoft, check out this great result for this query! I found it! Isn't is awesome?" They aren't. Users click on whatever results you put in front of them, generally starting with the top and working their way down.
Ranking results is not a science with some objective optimal conclusion. Ranking results is fundamentally subjective, and while data-driven, is ultimately an opinion. The user does have some discriminating power in this whole feedback loop, but it's miniscule compared to Google figuring out how to show them that result in the first place. Bing is taking the closest proxy that they can practically acquire for Google's opinion, and using it directly in their ranking.
I'm certain they never stopped trying to be innovative, but Bing is able to use that click stream data to reap the benefits of Google's innovations. Google is on the right side of this argument even if it makes them seem whiny to some people.
> 1) ...neither Bing nor Google does anything to protect user search privacy.
do you mean external or internal privacy? in terms of leaking information, SSL for search is about as good as you're going to get for privacy...if it's the browser or an extension (toolbar) that's watching searches, there's nothing a web site can do.
> 2) I think Google has more to lose by bringing this to light than they have to win.
That may be true, but there was a pretty cutting colbert segment on this last night where nuances about clickstreams weren't really a concern. personally I think google should have taken the humour route in the first place, as the "smoking gun" isn't all that damning at first sight. it requires some thought, which leaves plenty of room for disagreement and doubt.
Google has something to gain from all this. Consider the implications to Bing/Google competition for marketshare in the following scenario:
MS decides to bundle and auto-install the Bing toolbar in IE. Also default opt-in to the "share clickstream data with MS" option. Now you have tons of users with the Bing toolbar _and_ using Google to search. MS could "use" or "copy" the results of Google's hardwork on search and pagerank and consequently provide some serious competition for search marketshare.
I found the discussion video on that page fascinating, in how Harry Shum's take on the other topics echoed his defence of Bing on this issue.
On the copying: Bing has dramatically improved over recent history because we use lots of inputs and we would have preferred if Google had talked to us privately so we could figure out how to make this less obvious on the long tail.
On web spam: Google is the industry leader and needs to share more with us little guys so we can all work together to beat this.
On search quality: Google needs to disclose their quality metrics so the industry as a whole can understand what users want. Then we can all make search better for everyone.
I thought Cutts was very gracious in his responses despite looking incredulous at hearing some of this stuff.
Look, this "let's all get along and work together as an industry to fix problems for the user"... it's bullshit. Either compete fairly or don't, but don't pretend that Google owes you data so you can get a leg up.
That shows a lot of bad faith on the part of Matt Cutts. It has been explained to death, here and elsewhere, what most likely happened. The fact that he continues to make the same accusation... well frankly he lost a lot of credibility with me. One more manipulative corporate drone, one less genuine hacker.
This is, by far, the best treatment of the issue I've seen anywhere, on account of how it's the only level-headed one apart from Nate Silver's. There's a little needless inflation at the beginning with the screenshot comparison. It does have the obvious and expected slant. On the other hand, there's no enormous hyperbole or anything like that. There are no vacuous statements. The word "copying" appears a few times, but that's at least descriptive, and there's no use of other loaded terms like "cheating", "stealing", and "unethical".
Also, for what it's worth, it moved me from 80/20 certain that nothing fishy is going on to 50/50: the spell correction paper shows to a certainty that this has been considered. I'm sure Googlers know as well as I do that not everything in a research paper finds its way into the product, and this one in particular may have been nixed by management types for this very reason. It's still awfully suspicious.
I still think the original experiment completely fails to demonstrate anything unethical, and I still think the original info release was both hyperbolic and needlessly inflammatory. It does demonstrate a need for some more information, which seems to be all this post is asking for. If it had looked more like this post, I think the 'net could have been spared a lot of controversy. Maybe Matt Cutts should be writing these things, though far be it from me to decide that.
I'm shocked at the amount of suppport on HN (thus far) for Bing's attitude on this issue.
I understand that they may well not target Google's SERPs specifically in their clickstream analysis but they should certainly have excluded Google from it, for ethical reasons.
Google state that Bing created associations from clickstreams through Google's SERPs on common queries (e.g. the tarsorrhaphy spell check test), not just long tail queries. Given that Google is extremely popular, this must have given a lot of weight to clickstream signals resulting from Google SERPs on many occasions, for common queries. That is entirely unethical and I'm shocked so many here don't find a problem with this.
Take the case of a highly-ranked great result on Google for a particular term, which Bing rates lower due to inferior algorithms. Bing's analysis of Google users would send that result higher in the Bing SERPs, mainly due to Google's expertise in highlighting that site, and only to a small degree due to the user's choice of clicking on it. That to be does fall under "copying" Google's results, it may not be illegal, or intentional, but copying it remains.
Bing should have excluded Google from their clickstreams, and I certainly hope Google exclude Bing from their's. (Matt Cutts stated they do in the video.)
Let's see. You're ok with Bing doing click analysis, but you want them to specifically throw the clicks away when the user stopped by Google, because it would be unethical. And also probably remove clickstreams that involve Yahoo's directory. And StackOverflow answers, because it would be cheating. And blogs referencing interesting links. And... or just accept that clickstream analysis is a great source of data that will be an aggregation of many, many other things on the Internet. Including Google.
A quick thought experiment on this subject: say I search for a term on Bing, find a link I want and then put that link on my blog - when Google indexes that blog have they copied Bing?
Sure its different, but is it meaningfully different? I made the link between the two terms, I also consented for that data to be used in both cases (assuming the data comes from the Bing toolbar and an agreeable robots.txt). I just don't see how the data that Microsoft is using is off limits.
Just putting the link on your blog doesn't establish the connection between the keywords and the link. If you put both the keywords and the link on your blog then it allows Google to independently establish that connection using the merits of their own algorithms.
The "incrimminating" part here is that Microsoft appears to be intentionally parsing the keywords out of the search. Which means they are intentionally looking at a click and saying a) this is a google search and here are the terms used and then b) this is a result that Google returned, and they are then using that to fill their own index. If they generically parsed the URL for terms then you might argue they are not giving Google special treatment, they are just doing this for every page the user goes to. However that's a bit hard to buy - if they really did that they would end up with all kinds of garbage associations from opaque URLs. So they must have a signal saying "this was a google search, treat it better than the others". Or they are somewhere in between the two. It's not clear to me where they fall on this scale - it's generally murky. At worst, I'd say they are copying, at best, I'd say it's sneaky but clever and fair game. The minute they single out Google and say "hey, this must be a good result" I think they crossed a line.
I still don't understand why Google hasn't done the exact same test with a control variable. Run the exact same test again on a domain that isn't google.com.
Most of my thoughts have already been echoed elsewhere in this thread, but:
> "To me, what the experiment proved was that clicks on Google are being incorporated in Bing’s rankings."
It proved that clicks on other websites are being incorporated in Bing's rankings, which had already been public knowledge, I think. It didn't prove that only or disproportionately clicks on Google are being thus incorporated, although that is what Google is repeatedly claiming.
> "If clicks on Google really account for only 1/1000th (or some other trivial fraction) of Microsoft’s relevancy, why not just stop using those clicks and reduce the negative coverage and perception of this?"
Is Matt Cutts suggesting that Bing special-case an exclusion for Google results?
The technical aspect is plain. Copying Google is a great idea. If you're tying to predict the weather and are given a general bag of indicators, the one that is a well-reasoned expert opinion of the weather is probably a very important source. It may even account for a majority of your own predictive power. It would be stupid to ignore it if your goal is to simply make the best prediction.
Likewise, aggressively scraping Google is a smart move. Then you add some new innovation atop it and have a real opportunity to return more informed responses. This is done all the time in science.
In some sense Google simply has to acknowledge that they are a pretty important segment of the web, not some separate entity from it.
So only the legal/ethical question remains. In science it's unethical to work atop someone else's project without crediting them. I doubt Bing would be interested in adding a Powered by Google bar. Moreover, since Bing could directly profit off Google's work undercutting actual algorithmic progress through pure marketing competition (hypothetically, anyway, I am sure that Bing has added tech too) I feel like it's better to restrict this sort of thing.
I think it's fair to say that much like commercial images on Flickr or sample songs, it is unethical and illegal to copy digital services and goods then either claim them your own or profit off of them. I think Google results are suitably close in spirit to this.
So maybe Bing and Girl Talk need to team up and discover and defend the ethical rights of sampling digital goods.
I totally agree with this. There's not really any good reason why Microsoft should piggyback off of Google's search results. The fact that they're able to observe Google's results by observing user clicks in a browser rather than by simply harvesting results directly off of Google's website confuses the matter, but ultimately it's irrelevant. This is just a sneaky way for Microsoft to observe a user's interaction with an information service (Google), to see what information the user obtained, and then offer the same information themselves.
Seems like Bing is having trouble explaining exactly what they are doing, and Google is having trouble explaining why they should stop.
Is it illegal? If not, why should Bing stop doing it?
> I think Bing’s engineers deserve to know that when they beat Google on a query, it’s due entirely to their hard work. Unless Microsoft changes its practices, there will always be a question mark.
Kind of rings hollow to me. If I were Bing, I'd want to do what's best for my users.
Well, the whole thing's kinda meta, isn't it? The content isn't on either site, it's just, really, the content identifiers. Not even that, this is about a sorting of identifiers against arbitrary strings of text. Whatever distinctions we have here are gonna appear incredibly small (and petty), but it's really the bread and butter of search.
Both are map makers in a sense, providing guides to what they didn't create, but still spent plenty of effort to make that guide. In the literal, map-making world, the artifact is the printed map. In that world, the way to check for copyright infringement is to see if mapping errors are duplicated as well.
The presumption being that if the errors were copied, then so was the good data.
Short-term, the benefit was that users would get discounts on high-quality data, as they only had to pay for the efforts of the map-copy, not the original map data acquisition. Of course, then you're just waiting for the quality to drop, as there's less and less incentive to actually do the map-making work. The margins go down, and the original sources have to update their maps less often to keep their costs low enough to be competitive.
There isn't a printed page with web search; the product is the output of a continuously-running dataset & algorithm.
But, I'm gonna ask, in the web-search world, how do you define copying and how do you test for it? If you don't think there is a valid definition, please don't count yourself the same as the group who thinks that there is a valid definition and this isn't it. They're two separate things.
(I've framed the question how I see it, and I work for Google, but I'm obviously no official speaker -- I've only been here a few months, and don't work in search quality. This is (almost definitionally) a fanboi war of sorts, and I wanted to stay out. I probably should have :( )
Google should do more testing. For example: repeat the same tests (different nonsense search terms and results just to be sure) where none of the data sent to/from the browser contains any google specific terminology. If the Bing toolbar isn't special-casing google then I think the test results should be the same.
[+] [-] kenjackson|15 years ago|reply
My take on this is quite the opposite. If MS thinks they're doing nothing wrong, they shouldn't stop. They should do what is best for their customers.
Reversing what they do now because Google "caught" them, would IMO, imply they were doing something wrong and got caught.
I personally think what Bing is doing is great. I regularly use Bing and Google. I wish there were a streamlined way for me to broadcast to both of them... "for search term X this is the relevant link!"
MS has found a way to do this, and as long as its opt-in, I like it. I wish Google had the same (although I do realize that since Google owns so much of the traffic it is less of an issue for them -- its a net loss in terms of the flow of information).
But maybe a question for Matt... Can you and MS work together for a toolbar that does just this for both engines?
As a user this isn't a matter of Bing copying Google or not. But its about the fact that search relevance still kind of sucks. I feel like I can make the search experience better, especially in those occassions when I go to your competitors site.
[+] [-] pessimist|15 years ago|reply
IMHO, there's a loophole in Bing's argument that the user click data is free for them to use. What about the site - do they have no ability to consent/dissent to being tracked?
Maybe robots.txt needs to be updated for the toolbar.
[+] [-] gregable|15 years ago|reply
- the user's preference
- Google's ranking algorithm.
Had Google not ranked that URL, the user wouldn't be broadcasting it. If there was some way to extract the factor of the user's preference of URLs as a signal without the factor of Google's ranking algorithm, you would be more spot on. Unfortunately this isn't possible. Also, in most cases the ranking algorithm component of this broadcasted data is the more useful factor of the two.
[+] [-] angusgr|15 years ago|reply
All of the negative perception seems to come from Google deliberately inserting outlier data. For MS to back down on that basis immediately makes them look like they were doing much more than what is proven.
[+] [-] jsnell|15 years ago|reply
What if it slows down innovation in search quality and moves effort into marketing, shiny baubles? After all, why bother spending resources on researching, implementing and evaluating the algorithms when your competitors will within a couple of months gain the most of the improvements, automatically with no effort.
You complain about search relevance sucking. Why are you so eager to support a practice that has a high chance of causing it to stagnate?
[+] [-] haberman|15 years ago|reply
Does this mean anyone should feel free to ignore robots.txt if they think it will be better for their customers?
[+] [-] ryanhuff|15 years ago|reply
[+] [-] ddkrone|15 years ago|reply
[+] [-] webwright|15 years ago|reply
MS clearly uses toolbar users' clickstreams (on and off Google) to improve their own search efforts. Google created an artificial scenario where the ONLY input was Google search behavior and lo, the search results are exactly the same. Whether or not that steps over a line (I don't feel that it does), it's not "copying" in my book.
Another interesting point is that Google has been beating the "open" drum for a long time now. No walled gardens, right? If a Facebook user should have the right to take his data with him wherever he wants to go, shouldn't a Google user be able to fork over their behavior data to Bing?
Matt's point about MS' lack of clarity when getting folks permission to grab their clickstream was dead-on. THAT is pretty outrageous and MS should be ashamed of that.
Regardless of all that, hats off to Matt for keeping a cool head and stating his position in a respectful way.
[+] [-] boredguy8|15 years ago|reply
You'd have to make the argument that using this data is wrong, somehow. But to make that argument, you'd basically have to argue that users shouldn't be able to share their habits with whomever they want to. I doubt that argument can be made in a compelling way.
I'm surprised Google is pushing this further, and a little disappointed.
[+] [-] jimbokun|15 years ago|reply
Cutts addresses this:
"As we said in our blog post, the whole reason we ran this test was because we thought this practice was happening for lots and lots of different queries, not simply rare queries."
[+] [-] onecommentayear|15 years ago|reply
[+] [-] angusgr|15 years ago|reply
However, if you install the Bing Toolbar then it does send URL clickstream data. It explicitly asks you beforehand if you want to send info about "the searches you do, websites you visit..." though.
Full post: http://projectgus.com/2011/02/bing-google-finding-some-facts...
[+] [-] mayank|15 years ago|reply
> The behaviour I’ve seen explains Google’s experiments, but does not support the accusation that Bing set out to copy Google.
I don't think it's so much about "set out to copy Google" necessarily, as it is that they are explicitly parsing Google queries and results from the clickthrough data and using it (quite directly) for their own results. What they set out to do is immaterial given what is provably happening.
> Bing Toolbar is tracking user clicks and Bing could use the result to improve search results. I don’t personally see any great distinction between this behaviour and Google’s many tracking, indexing and scraping endeavours which they use to improve their own search results.
The difference is that Google has proven that the results of certain queries are being directly fed as Bing results. If Microsoft does the same with Google rankings, I'd see your point, but right now the evidence only points in one direction.
> While I personally dislike the privacy implications, Bing Toolbar is pretty upfront about it when it gets installed (unlike much web page user tracking.) The fact that the tracking is plain HTTP not HTTPS, with the content in plaintext, would seem to indicate that they weren’t seeking to hide anything.
I'd be interested to see if Google over HTTPS queries are being transmitted by the toolbar over HTTP. That would be a pretty serious privacy violation IMO, especially when you pair that with unencrypted wifi at Starbucks. See the AOL search log fiasco: http://www.somethingawful.com/d/weekend-web/aol-search-log.p...
[+] [-] hackinthebochs|15 years ago|reply
Google has created precisely the conversation they wanted based on their assertions then gets indignant when Microsoft refuses to step into it.
[+] [-] stumm|15 years ago|reply
Stefan Weitz, director of the Bing search engine at Microsoft, said in an interview the company studies how certain users interact with Google in order to improve Bing. It does this by looking at "clickstream data," or information that users of Microsoft's Internet Explorer or the Bing search toolbar voluntarily share with the company
source: http://online.wsj.com/article/SB1000142405274870412450457611...
[+] [-] wzdd|15 years ago|reply
[+] [-] user24|15 years ago|reply
I was totally wrong. Bing are copying Google. I'm sorry (I am the "What on earth are Google thinking" author).
I still think the honeypot experiments didn't support the conclusion. But this paper coupled with Bing's lacklustre pseudo-denial strongly indicates that my view of events was not accurate and that Bing were indeed blatantly piggybacking off Google's hard work.
I'm really disappointed. I gave Bing the benefit of the doubt, saw that there was another conclusion which explained Google's observations and I advocated that. But now it really looks like MS overstepped the mark and were deliberately picking out Google URLs in order to use signals from Google's algorithms to improve Bing.
That really is cheating.
edit: although still it's not conclusive. I don't know what to think now. I think I should stick to writing code :)
[+] [-] kenjackson|15 years ago|reply
But the paper doesn't say this is what Bing uses. Looks like it, but we should be clear. This is MSR, not Bing.
Good paper though. But it is another example of MS actually not thinking its bad. They published a paper on this where they spell out how MSR would do this. Did people in the academic community protest that this would be copying if implemented?
[+] [-] moultano|15 years ago|reply
The reality is that it's almost impossible for google to measure how much their results influence bing. After all, they are both trying to rank the same internet. The best they can do is prove the obvious cases and let bing do the explaining.
[+] [-] Matt_Cutts|15 years ago|reply
[+] [-] tristanperry|15 years ago|reply
"Since people at Microsoft might not like this post, I want to reiterate that I know the people (especially the engineers) at Bing work incredibly hard to compete with Google, and I have huge respect for that. It's because of how hard those engineers work that I think Microsoft should stop using clicks on Google in Bing's rankings. If Bing does better on a search query than Google does, that’s fantastic. But an asterisk that says "we don't know how much of this win came from Google" does a disservice to everyone. I think Bing's engineers deserve to know that when they beat Google on a query, it's due entirely to their hard work. Unless Microsoft changes its practices, there will always be a question mark"
I agree with this entirely. Given Bing's resources, it seems bizarre that they were relying on Google's results even a little bit. And until they sort this issue out, I definitely agree that it's difficult to know when to give credit to Bing.
(Plus it's made all the more difficult considering that Bing keep sort of denying this whole mess, despite the quite conclusive proof. I suspect that if they do stop using Google's results, we won't know about it considering the hole that Bing have dug themselves with their denials. Ah well, que sera sera).
[+] [-] whiletruefork|15 years ago|reply
The Chrome and Gmail EULA's are as bad as the IE8 Suggested Sites bit mentioned in this article. GMail basically reads your email to provide contextually relevant ads. Does my mom know that? No.
Not a plug, but I blogged about this @ http://roshank.posterous.com/google-versus-bing-no-one-is-be... . I believe this should be a discussion on ethics - and feel it is ethical for a company to do whatever it wants with data contained in its own software application.
[+] [-] andrewljohnson|15 years ago|reply
You said: " -Being part of '1000 signals' does not mean all signals are weighted evenly."
Matt quotes Nate: "First, not all of the inputs are necessarily equal. It could be, for instance, that the Google results are weighted so heavily that they are as important as the other 999 inputs combined."
[+] [-] zaidf|15 years ago|reply
And if they are not relying on facebook, why not just get rid of it all together?
"Because it helps the users". Oh I see. But so does Bing's actions.
"Because its individual users giving permission" Same with Bing.
[+] [-] davidu|15 years ago|reply
But...
1) I think they will regret it bigtime when all the attention they are causing makes the bored government officials poke their head in and realize that at a macro level, neither Bing nor Google does anything to protect user search privacy.
2) I think Google has more to lose by bringing this to light than they have to win. Despite Matt's defense, it's hard to see it as anything other than being petty and pedantic. But this is there response to anything that threatens their search pageviews, and that's understandable even if erroneous.
They should focus on trying to be innovative again, that was the Google I respected.
[+] [-] moultano|15 years ago|reply
I work in search quality at Google. I'm busting my ass every day working on fundamental reimaginings of how results get ranked. I'm going to keep doing that regardless of what bing does, because it makes the world a better place, and it's fun.
But, suppose the stuff I'm working on works out, and tomorrow Google shows up with a wholly new set of awesome results. This is very possible, there is a ton of headroom left in search quality, I've seen the experiments myself.
Then after a few weeks of sniffing clicks, Bing comes up with the same set of revolutionary results, but they have no idea how they got there, they have no idea what the evidence is to rank them there, all they know is that people like those results on Google. Is this fair? Is this ethical? Is this even legal?
People in this whole debate have the idea that the user is creating this association between the result and the query, like the user searched through the whole web and came back saying "hey microsoft, check out this great result for this query! I found it! Isn't is awesome?" They aren't. Users click on whatever results you put in front of them, generally starting with the top and working their way down.
Ranking results is not a science with some objective optimal conclusion. Ranking results is fundamentally subjective, and while data-driven, is ultimately an opinion. The user does have some discriminating power in this whole feedback loop, but it's miniscule compared to Google figuring out how to show them that result in the first place. Bing is taking the closest proxy that they can practically acquire for Google's opinion, and using it directly in their ranking.
[+] [-] coderdude|15 years ago|reply
[+] [-] magicalist|15 years ago|reply
do you mean external or internal privacy? in terms of leaking information, SSL for search is about as good as you're going to get for privacy...if it's the browser or an extension (toolbar) that's watching searches, there's nothing a web site can do.
> 2) I think Google has more to lose by bringing this to light than they have to win.
That may be true, but there was a pretty cutting colbert segment on this last night where nuances about clickstreams weren't really a concern. personally I think google should have taken the humour route in the first place, as the "smoking gun" isn't all that damning at first sight. it requires some thought, which leaves plenty of room for disagreement and doubt.
[+] [-] spaghetti|15 years ago|reply
MS decides to bundle and auto-install the Bing toolbar in IE. Also default opt-in to the "share clickstream data with MS" option. Now you have tons of users with the Bing toolbar _and_ using Google to search. MS could "use" or "copy" the results of Google's hardwork on search and pagerank and consequently provide some serious competition for search marketshare.
[+] [-] noibl|15 years ago|reply
On the copying: Bing has dramatically improved over recent history because we use lots of inputs and we would have preferred if Google had talked to us privately so we could figure out how to make this less obvious on the long tail.
On web spam: Google is the industry leader and needs to share more with us little guys so we can all work together to beat this.
On search quality: Google needs to disclose their quality metrics so the industry as a whole can understand what users want. Then we can all make search better for everyone.
I thought Cutts was very gracious in his responses despite looking incredulous at hearing some of this stuff.
Look, this "let's all get along and work together as an industry to fix problems for the user"... it's bullshit. Either compete fairly or don't, but don't pretend that Google owes you data so you can get a leg up.
[+] [-] KevBurnsJr|15 years ago|reply
5 minutes of hearing Bing's VP of Search talk and all I hear is distancing, deflection and double speak.
Bing does copy Google's results indirectly through click data gathered by the Bing toolbar.
[+] [-] alain94040|15 years ago|reply
[+] [-] Matt_Cutts|15 years ago|reply
[+] [-] amalcon|15 years ago|reply
This is, by far, the best treatment of the issue I've seen anywhere, on account of how it's the only level-headed one apart from Nate Silver's. There's a little needless inflation at the beginning with the screenshot comparison. It does have the obvious and expected slant. On the other hand, there's no enormous hyperbole or anything like that. There are no vacuous statements. The word "copying" appears a few times, but that's at least descriptive, and there's no use of other loaded terms like "cheating", "stealing", and "unethical".
Also, for what it's worth, it moved me from 80/20 certain that nothing fishy is going on to 50/50: the spell correction paper shows to a certainty that this has been considered. I'm sure Googlers know as well as I do that not everything in a research paper finds its way into the product, and this one in particular may have been nixed by management types for this very reason. It's still awfully suspicious.
I still think the original experiment completely fails to demonstrate anything unethical, and I still think the original info release was both hyperbolic and needlessly inflammatory. It does demonstrate a need for some more information, which seems to be all this post is asking for. If it had looked more like this post, I think the 'net could have been spared a lot of controversy. Maybe Matt Cutts should be writing these things, though far be it from me to decide that.
[+] [-] lamdanman|15 years ago|reply
I understand that they may well not target Google's SERPs specifically in their clickstream analysis but they should certainly have excluded Google from it, for ethical reasons.
Google state that Bing created associations from clickstreams through Google's SERPs on common queries (e.g. the tarsorrhaphy spell check test), not just long tail queries. Given that Google is extremely popular, this must have given a lot of weight to clickstream signals resulting from Google SERPs on many occasions, for common queries. That is entirely unethical and I'm shocked so many here don't find a problem with this.
Take the case of a highly-ranked great result on Google for a particular term, which Bing rates lower due to inferior algorithms. Bing's analysis of Google users would send that result higher in the Bing SERPs, mainly due to Google's expertise in highlighting that site, and only to a small degree due to the user's choice of clicking on it. That to be does fall under "copying" Google's results, it may not be illegal, or intentional, but copying it remains.
Bing should have excluded Google from their clickstreams, and I certainly hope Google exclude Bing from their's. (Matt Cutts stated they do in the video.)
[+] [-] alain94040|15 years ago|reply
[+] [-] greendestiny|15 years ago|reply
Sure its different, but is it meaningfully different? I made the link between the two terms, I also consented for that data to be used in both cases (assuming the data comes from the Bing toolbar and an agreeable robots.txt). I just don't see how the data that Microsoft is using is off limits.
[+] [-] zmmmmm|15 years ago|reply
The "incrimminating" part here is that Microsoft appears to be intentionally parsing the keywords out of the search. Which means they are intentionally looking at a click and saying a) this is a google search and here are the terms used and then b) this is a result that Google returned, and they are then using that to fill their own index. If they generically parsed the URL for terms then you might argue they are not giving Google special treatment, they are just doing this for every page the user goes to. However that's a bit hard to buy - if they really did that they would end up with all kinds of garbage associations from opaque URLs. So they must have a signal saying "this was a google search, treat it better than the others". Or they are somewhere in between the two. It's not clear to me where they fall on this scale - it's generally murky. At worst, I'd say they are copying, at best, I'd say it's sneaky but clever and fair game. The minute they single out Google and say "hey, this must be a good result" I think they crossed a line.
[+] [-] _flag|15 years ago|reply
[+] [-] blahedo|15 years ago|reply
[+] [-] blahedo|15 years ago|reply
> "To me, what the experiment proved was that clicks on Google are being incorporated in Bing’s rankings."
It proved that clicks on other websites are being incorporated in Bing's rankings, which had already been public knowledge, I think. It didn't prove that only or disproportionately clicks on Google are being thus incorporated, although that is what Google is repeatedly claiming.
> "If clicks on Google really account for only 1/1000th (or some other trivial fraction) of Microsoft’s relevancy, why not just stop using those clicks and reduce the negative coverage and perception of this?"
Is Matt Cutts suggesting that Bing special-case an exclusion for Google results?
[+] [-] pkamb|15 years ago|reply
[+] [-] tel|15 years ago|reply
Likewise, aggressively scraping Google is a smart move. Then you add some new innovation atop it and have a real opportunity to return more informed responses. This is done all the time in science.
In some sense Google simply has to acknowledge that they are a pretty important segment of the web, not some separate entity from it.
So only the legal/ethical question remains. In science it's unethical to work atop someone else's project without crediting them. I doubt Bing would be interested in adding a Powered by Google bar. Moreover, since Bing could directly profit off Google's work undercutting actual algorithmic progress through pure marketing competition (hypothetically, anyway, I am sure that Bing has added tech too) I feel like it's better to restrict this sort of thing.
I think it's fair to say that much like commercial images on Flickr or sample songs, it is unethical and illegal to copy digital services and goods then either claim them your own or profit off of them. I think Google results are suitably close in spirit to this.
So maybe Bing and Girl Talk need to team up and discover and defend the ethical rights of sampling digital goods.
[+] [-] o_nate|15 years ago|reply
[+] [-] dminor|15 years ago|reply
Is it illegal? If not, why should Bing stop doing it?
> I think Bing’s engineers deserve to know that when they beat Google on a query, it’s due entirely to their hard work. Unless Microsoft changes its practices, there will always be a question mark.
Kind of rings hollow to me. If I were Bing, I'd want to do what's best for my users.
[+] [-] lallysingh|15 years ago|reply
Both are map makers in a sense, providing guides to what they didn't create, but still spent plenty of effort to make that guide. In the literal, map-making world, the artifact is the printed map. In that world, the way to check for copyright infringement is to see if mapping errors are duplicated as well.
The presumption being that if the errors were copied, then so was the good data.
Short-term, the benefit was that users would get discounts on high-quality data, as they only had to pay for the efforts of the map-copy, not the original map data acquisition. Of course, then you're just waiting for the quality to drop, as there's less and less incentive to actually do the map-making work. The margins go down, and the original sources have to update their maps less often to keep their costs low enough to be competitive.
There isn't a printed page with web search; the product is the output of a continuously-running dataset & algorithm.
But, I'm gonna ask, in the web-search world, how do you define copying and how do you test for it? If you don't think there is a valid definition, please don't count yourself the same as the group who thinks that there is a valid definition and this isn't it. They're two separate things.
(I've framed the question how I see it, and I work for Google, but I'm obviously no official speaker -- I've only been here a few months, and don't work in search quality. This is (almost definitionally) a fanboi war of sorts, and I wanted to stay out. I probably should have :( )
[+] [-] spaghetti|15 years ago|reply