top | item 15071492

Studying how Firefox can collect additional data in a privacy-preserving way

278 points| GrayShade | 8 years ago |groups.google.com | reply

428 comments

order
[+] kannanvijayan|8 years ago|reply
I can do a quick summary of what's being proposed and why. I work in the JS team at Mozilla and deal directly with the problems caused by insufficient data. Please note that I'm speaking for myself here, and not on behalf of Mozilla as a whole.

Tracking down regressions, crashes, and perf issues without good telemetry about how often it's happening and in what context. Issues that might have otherwise taken a few days to resolve with good info, become multi-week efforts at reproduction-of-the-issue with little information.

It simply boils down to the fact that we can't build a better browser without good information on how it's behaving in the wild.

That's the pain point anyway. Mozilla's general mission, however, makes it very difficult to collect detailed data - user privacy is paramount. So we have two major issues that conflict: the need to get better information about how the product is serving users, and the need for users to be secure in their browsing habits.

We also know from history that benevolent intent is not that significant. Organizations change, and intents change, and data that's collected now with good intent can be used with bad intent in the future. So we need to be careful about whatever compromise we choose, to ensure that a change of intent in the future doesn't compromise our original guarantees to the user.

This is a proposed compromise that is being floated. Don't collect URLs, but only top-level+1 domains (e.g. images.google.com), and associate information with that. That lets us know broadly what sites we are seeing problems on, hopefully without compromising the user's privacy too much. Also, the information associated with the site is performance data: the time spent by the longest garbage-collection, paint janks.

This is a difficult compromise to make, which is why I assume it took so long for Mozilla to come around to proposing this. These public outreaches are almost always the last stage of a length internal discussion on whether proposals fit within our mission or not.

I'm not directly involved in this proposal, but I personally think it's necessary, and strikes a reasonable balance between the privacy-for-users and actionable-information-for-developers requirements.

[+] stewbrew|8 years ago|reply
> Tracking down regressions, crashes, and perf issues without good telemetry about how often it's happening and in what context.

If that's what you're aiming at. Collect the data but keep it local. Install some sort of responsiveness/"problem" monitoring. Ask the user to send data relevant to the problem if a problem occurs. IMHO there is no need to systematically collect user data for that.

Or get the data from a random sample of users. You don't need data from everyone.

[+] keithpeter|8 years ago|reply
"This is a difficult compromise to make"

Then don't make the compromise.

As others have expressed here the reason few people opt in to data collection may be because they have chosen to use a Web browser that does not mandate the collection of data.

I'm assuming there will always be an opt out which I shall add to my list of things I have to do when installing Firefox.

[+] thanksgiving|8 years ago|reply
> I'm not directly involved in this proposal, but I personally think it's necessary, and strikes a reasonable balance between the privacy-for-users and actionable-information-for-developers requirements.

I use Firefox and always opt into any telemetry that sends data back to Mozilla. You could say I am a fanboy. I think it is a HORRIBLE idea and Mozilla should scrap it yesterday and never bring it up again. If people bring it up again, send them to the roof team (if it doesn't exist, create one). If they come downstairs, fire them. You already have people like me who are willing to opt-in to every single thing you can try. For example, Firefox nightly on Android has consistently crashed for me about every five minutes or so since the last weekend and yet I keep using it. Don't throw away this goodwill.

[+] mirimir|8 years ago|reply
This is a horrible development. If Mozilla starts collecting this sort of data on an opt-out basis, it will put many users at risk. Seriously, WTF?

> This is a proposed compromise that is being floated. Don't collect URLs, but only top-level+1 domains (e.g. images.google.com), and associate information with that. That lets us know broadly what sites we are seeing problems on, hopefully without compromising the user's privacy too much.

Sure, there's no problem with images.google.com because it's generically innocuous. But what about pornhub.com for users in Saudi Arabia? Or some Japanese site that's essentially child porn for users in the US? The top-level+1 domain in many cases is totally incriminating.

> Also, the information associated with the site is performance data: the time spent by the longest garbage-collection, paint janks.

Maybe so. But it's collection of the top-level+1 domain that's the problem.

> I'm not directly involved in this proposal, but I personally think it's necessary, and strikes a reasonable balance between the privacy-for-users and actionable-information-for-developers requirements.

Fine. But then, make it opt-in, to protect users.

[+] pishpash|8 years ago|reply
Many problems here:

1. You're proposing a mechanism for collecting data, and a strategy for extracting more data than you currently do. You have not figured out the type of data that you will finally need, only a set of things that you currently envision. Naturally, the data that you will collect in the future will be more than what you currently envision. There is built-in mission creep that is dangerous.

2. What you currently envision is not fleshed out as especially useful. You only believe it is useful. The pain point of biased data is red herring. Your concern is more about not enough data.

3. You have found a technology which you believe will allow you to collect a lot of data anonymously. But none of you seem to understand the technology very well. It seems like a shiny toy that you are eager to go to town with. I am not sure this is the right attitude.

4. You're proposing to use your users in lieu of proper testers, or to save time. There are many ways to properly test software and to save time. Have they been explored? There used to be a time when beta software was a thing. Prompt the users to become testers for your beta software. If users don't want to be testers then don't collect data from them. How much data do you actually need anyway? Have you fully utilized your existing data?

Over all, I see this as a nice-to-have luxury, not some life-and-death situation, and subverting the goodwill of users is not worth it, IMHO.

[+] joosters|8 years ago|reply
If it's so harmless, let users opt-in. Adding data collection via an opt-out is shameful, it shows that you know people would not want this and yet you'd prefer to get more data anyway.
[+] beachwood23|8 years ago|reply
Thanks for your input. Glad to hear someone from the Mozilla team on this thread.

Its an interesting compromise... because without improved performance and features, we'll lose Firefox entirely, and all of the relative privacy / security gains that entails. This is a good example where "perfect" privacy that reaches only a few is the enemy of "good" privacy that reaches more people.

[+] belorn|8 years ago|reply
If user privacy is paramount, then there are multiple ways to lower the privacy incursion that is caused by the data collection.

Only collect top-level domains of Alexa rank 1k. That users are using a highway is less sensitive than a specific street where there only exists 5 homes, and it reassures users that private domain names won't be leaked.

Send the data through Tor. That way you only get the data about the browser <-> site interaction, not user<->browser<->site interaction.

And make it opt-in and notify users of the purpose of the data collection. A good model to follow here is Debian installer and popcon. Follow the good practices of data collection in the free software world and do not use dark patterns.

[+] yuhong|8 years ago|reply
Mozilla's crash reporter already has the option of submitting the URL.
[+] AdmiralAsshat|8 years ago|reply
Top-level domains are still betraying the user's privacy. Does it bug me that PornoTube is significantly laggier on Firefox than YouTube? Sure. Do I want Mozilla to know that I'm visiting it? Hell no.
[+] noir_lord|8 years ago|reply
No compromise, I switched to FF on Android to avoid this crap from Chrome and now you'll do it as well.

I look forwards to the fork.

[+] z3t4|8 years ago|reply
Take a list of sites, for example Alexa top 10,000 and make an automatic script that browse these sites and collect whatever information you need. Have a bunch of devices, phones, laptops, PC's from different brands doing this. This will not cost much and you don't have to spy on your users.
[+] gorhill|8 years ago|reply
> Don't collect URLs, but only top-level+1 domains (e.g. images.google.com), and associate information with that.

Using "images.google.com" as an example is too convenient.

That would be great if you could also add whatever TLD+1 most people would rather keep private as another example right after "images.google.com".

[+] SilasX|8 years ago|reply
>This is a proposed compromise that is being floated. Don't collect URLs, but only top-level+1 domains (e.g. images.google.com), and associate information with that.

Until sites start programmatically generating a unique subdomain for each [Firefox] user.

[+] clarkevans|8 years ago|reply
> Don't collect URLs, but only top-level+1 domains (e.g. images.google.com)

Do you consider images.google.com to be eTLD+1? The eTLD would be .com; so, eTLD+1 would be google.com; and hence, images.google.com would be eTLD+2?

eTLD: https://en.wikipedia.org/wiki/Public_Suffix_List

[+] syshum|8 years ago|reply
>>This is a difficult compromise to make,

Sorry I do not accept this compromise. Mozilla seems to have lost its way of late. Sad to see a company that was at the fore front of Privacy, and Security abandon that in name of market share and performance.

I would rather sacrifice performance for privacy, not the other way around.

From EME, to the adoption of Browser Extensions as the only customization option, now this.... Mozilla and FF is changing in ways that are harmful to the open, secure, and private web. Following the trends and policies of MS and Google is not the correct path.

[+] feelin_googley|8 years ago|reply
".. we can't build a better browser without good information on how it's behaving in the wild."

Who decides what is a "better" browser?

1. Is it the authors? Do they write the software for themselves and agree to share it for free with anyone who may want to use it?

2. Is it the users? Do the authors solicit feedback from users to determine what users want? If users demanded a browser with no default telemetry, would the authors comply?

3. Is it third parties who have an interest in the behavior of users? For example, domain name industry, ad-supported businesses, their employees or advertisers themselves. Are the authors on salary, compensated indirectly from advertising revenue? Or does it come from somewhere else?

4. Is it all of the above? If we follow the money where does it lead? Whose decision of what is "better" is the most important?

Mozilla is descended from a defunct 1990's company that aimed to license a web browser to corporations for a fee. It would have been very clear in that case who the browser was being written for. But today, it is not so clear who Mozilla is serving. It resembles some sort of "multi-stakeholder" project.

It would be nice to have a browser that fits description 1 or 2. I believe there are plenty of folks, including some developers, who would appreciate a browser with no default telemetry. By virtue of the total absence of data collection, they might consider it "better" than alternative browsers that "need telemetry" for whatever reason.

[+] im3w1l|8 years ago|reply
There are many very, very political people inside Mozilla. Some of them may even want to commit political violence. Political violence seems to be a problem that just grows and grows, so how can we be sure that it's not supported in Mozilla. These would be a very small minority of Mozilla of course, but the problem is that you don't know who it is. And it only takes a single extremist to betray your users. To get your users injured or even killed.

The same concern will of course apply to any other data harvesters, but that's for another thread

[+] disconnected|8 years ago|reply
Ok, I get your point. You need the extra debugging information.

Now, here's my concern. I DO NOT want compromises. I DO NOT want to balance anything. I DO NOT want this telemetry crud on my browser spewing out my browsing history to anyone, no matter how anonymous you people claim it will be.

I just want a decent web browser.

What are my options? "Mozilla's way or the highway"? Redirect evil.telemetry.things.mozilla.org to /dev/null? Go back to elinks?

Or will there be a "disable this piece of crap utterly and completely" button somewhere not hidden under an URL? Or even better, a compile flag?

Edit: spelling...

[+] msla|8 years ago|reply
You wouldn't say anything else, so your statements don't change anything: Any company which wants to collect more data would justify it in the same way.

The main reason to collect data is monetization. People don't like to think they're being sold, so it's justified on other grounds. That's a universal. Since the way data is monetized is to track and segregate users, claims that it can be done in a privacy-respecting fashion are, therefore, specious.

There is one conclusion to be drawn here, and it isn't that Mozilla is going to respect my privacy.

[+] frankmcsherry|8 years ago|reply
As someone familiar with differential privacy, and (somewhat less) with privacy generally, here are some suggestions for Mozilla:

1. Run an opt-out SHIELD study to answer the question: "how many people can find an 'opt-out' button?". That's all. You launch this at people with as much notice as you would plan on doing for RAPPOR, and see if you get a 100% response rate. If you do not, then 100% - whatever you get are going to be collateral damage should you launch DP as opt-out, and you need to own up to saying "well !@#$ them".

2. Implement RAPPOR and then do it OPT-IN. Run three levels of telemetry: (i) default: none, (ii) opt-in: RAPPOR, (iii) opt-in: full reports. Make people want to contribute, rather than trying to yank what they (quite clearly) feel is theirs to keep. Explain how their contribution helps, and that opting-in could be a great non-financial way to contribute. If you give a shit about privacy, work the carrot rather than the stick.

3. Name some technical experts you have consulted. Like, on anything about DP. The tweet stream your intern sent out had several historical and technical errors, and it would scare the shit out of me if they were the one doing this.

4. Name the lifetime epsilon you are considering. If it is 0.1, put in plain language that failing to opt out could disadvantage anyone by 10% on any future transaction in their life.

I think the better experiment that is going on here is the trial run of "we would like to take advantage of privacy tech, but we don't know how". I think there are a lot of people who might like to help you on that (not me), and I hope you have learned about how to do it better.

[+] embik|8 years ago|reply
This is ridiculous. I use and recommend Firefox for pure ideological reasons, because frankly, Chrome/Chromium is miles ahead of them.

If they start opt-out tracking using the same approach as Google I do not see any reason to use it nor install it for my friends and family. That's some data for you, Mozilla.

[+] huhtenberg|8 years ago|reply
The single largest advantage of Firefox over other browsers is that despite all odds and occasional missteps they managed to respect users' desire for complete privacy.

  For Firefox we want to better understand how people use our 
  product to improve their experience. 
Sure thing. But the fact that they are unhappy that some (many?) people are opting-out from the data collection is merely a sign that they don't want to understand why people are using Firefox in the first place. By opting out from the data collection people effectively tell them over and over again that they don't want for Mozilla "to understand how they use Firefox" or "to improve their experience", not at the expense of their privacy.

No phoning home. No telemetry, no data collection. No "light" version of the same, no "privacy-respecting" what-have-you. No means No. Nada. Zilch. Try and shovel any of that down people's throats and the idea of Firefox as a user's browser will die.

[+] gbuk2013|8 years ago|reply
> No phoning home. No telemetry, no data collection. No "light" version of the same, no "privacy-respecting" what-have-you. No means No. Nada. Zilch. Try and shovel any of that down people's throats and the idea of Firefox as a user's browser will die.

https://github.com/mozilla/addons-frontend/issues/2785

And now this :-(

I have been using Firefox since before it was called that. I develop my apps in it, even though most of my colleagues have switched to Chrome years ago. Even though it is (or was for a while) slower than Chrome for things like Canvas.

But I use because I believe in Free Software. But Mozilla keeps disappointing. DRM, bundled 3-rd party apps, analytics, tracking... It is just so very sad. :-(

Also, I have 17 add-ons installed (11 active). At present, of these 17, only 2 will continue working after November when the switch to WebExtensions is enforced.

Where to go from here?

[+] Ajedi32|8 years ago|reply
I'm not really sure what your concern is here. Let's assume for a moment that Firefox's implementation of differential privacy in this scenario is completely correct, and that as a result it's completely impossible (even in an information-theoretic sense) to learn anything about any individual user using this data; only about many users in aggregate.

In this scenario, how exactly would Firefox's actions here compromise anyone's privacy?

[+] gvx|8 years ago|reply
> Currently we can collect this data when the user opts in, but we don't have a way to collect unbiased data, without explicit consent (opt-out).

That to me suggests the problem isn't that too many people are opting-out, it's that not enough people are opting-in.

[+] kogepathic|8 years ago|reply
> What we plan to do now is run an opt-out SHIELD study [6] to validate our implementation of RAPPOR.

IMHO, this is a bad idea. Many people I know already use Firefox because they're weary to give Google (Chrome) all their data.

Firefox should make this feature opt-in only.

[+] cJ0th|8 years ago|reply
While I do understand the allure of collecting this kind of data I find it highly disturbing to see this from Mozilla.

I think not having perfect information about the users is a trade off that should be made in order stay an alternative to most other browsers. There are still ways to get more data by other means, though. When it comes to most visited websites, for instance, the alexa ranking should give a good, if not perfect, idea.

[+] stutonk|8 years ago|reply
Just want to add a little volume to the general opinion here that collecting user data, no matter how anonymous, is a terrible idea for a product whose only appealing quality is that it respects its users privacy.

Data is both highly alluring and addictive as evinced here by Mozilla potentially willing to shoot itself in the foot to get some. What's to keep this from becoming a frog in a boiling water kind of situation? How can I trust that Mozilla is going to adhere to their own stated standards? The easiest answer is that I won't have to because I can just use something else. Personally, the only reason I use Firefox is because it's slightly less convenient to set up a secruity-patched version of Chromium.

Other people in this thread have made the excellent points of the fact that not enough people opting in to data collection is in itself a critical piece of data. Moreover, things such as "Which top sites are users visiting?" can be answered by looking at data from page ranking services and then they can go to those sites on their own testing equipment to answer their other questions. A little investment in acquiring this data by not spying and maybe getting a wider array of testing equipment is probably less costly than the potential for loss in market share that they're already struggling to hold.

[+] dagenleg|8 years ago|reply
In the end Mozilla is simply going to go through with it and there's nothing we can do about it. Just like with the killing of the XUL plugins - the company simply didn't care about the outcry. I mean why would they? The amount of people that cares about stuff like 'customization' or 'privacy' is slim.

So we will toothlessly complain but then the changes will be shoved in our throats, because obviously why would one care what the non-targeted demographics whines about. And of course it will be framed as being 'for our own good' and half of the people complaining with just deal with it, just like the majority already does.

[+] dhimes|8 years ago|reply
I generally trust Mozilla, but I really don't understand what they are going to get out of the data. Their explanation leaves me scratching my head. Perhaps it's simply because I don't work on browsers?

How does seeing which sites users use that need Flash drive their decision-making. Either they support Flash, or they don't.

And- ditto for "Jank" (not sure I understand that term, frankly- why is it capitalized?). Some developers don't optimize well- how is Mozilla going to use this? I think they do a good job over on MDN.

I guess I'd like to be sure I understand what problem they are trying to solve. Maybe they feel like without understanding their users they can't keep up with Chrome. I see people talking about how good Chrome is. And I must admit- it is sweet for me too. But that may be because (1) I don't have it loaded up with add-ons like I do Mozilla and (2) they have optimized for certain sites like youtube and gmail and I just can't get Firefox to work all that well on those sites.

But I'm not convinced that they need my data to fix that.

EDIT: On the other hand, Chrome seems to lose my passwords on every upgrade so it won't be my main browser until if fixes that little issue, which is going on, what, 5 years now?

[+] froydnj|8 years ago|reply
(Disclaimer: I work for Mozilla.)

"Jank" is our internal term for slow, non-responsive interaction with the browser (the capitalization of it in the original message is a little peculiar). If you click your mouse button, and then a second or more later, the item that you were clicking on the screen responds? That's jank. That input form that's not keeping up with your typing? That's jank. And so on.

We can (and do) collect statistics on how much jank people are experiencing, and we can look for ways to improve those statistics, but knowing what particular sites (not complete URLs, just eTLD+1 sites) jank occurs on is much more actionable. Browser developers can go visit particular sites to experience and analyze the jank for themselves, or we can see what janky sites are particularly popular in a given region and focus our efforts on improving those sites--either by doing things more efficiently in the browser, or reaching out to the site developers and asking them to consider changing things to make their site work better in Firefox. (Complete URLs would be even more actionable, but we don't want to collect your complete browser history.)

The argument for Flash is similar: we can get aggregate usage numbers for Flash, and perhaps see how that correlates to jankiness (or crashiness, or what have you), but having some information on what sites are using Flash makes the data even more actionable, for similar reasons as those given above.

[+] damnfine|8 years ago|reply
I say it over and over. You can not completely anonymize data with any reliability. Please note the qualifier, many systems work for many vectors, but any sufficiently large dataset can be used to graph habits and correlate them. Maybe there is a safe way, but I put the onus of proving it on the person implementing it.
[+] digitalzombie|8 years ago|reply
> You can not completely anonymize data with any reliability.

Well... there's actually a field for that. I forgot what they call that field because of how niche it is but my friend at google is doing just that.

He said there are math theorem to prove that it's sufficiently anonymize.

He gave an example of how Netflix competition with the data they gave researchers were able to deanonymize it. And his job was to prevent that at google.

I can see why if you're trying to sell users data while maintaining privacy.

[+] js8|8 years ago|reply
I liked Firefox for years. I have lived through years of shenanigans such as broken extensions, forgetting what tabs I had open because Firefox accidentally closed without restoring them, moving icons and menus around for no reason, and recently, an update on my Ubuntu that broke scrolling of pages (with PgUp/PgDown). And now this..

I am starting to think that they just don't want people to use Firefox.

Yeah, I know it's free software, so I have no right to complain. I just wonder why?

[+] tunap|8 years ago|reply
Where governments and corporations are concerned, the "why" condenses down to two simple answers: commercialization(profit) or weaponization(control)... it is easily conceivable that both will result over time. I hope Tor & EFF start giving more love to Pale Moon & it's ilk, but that may just be mitigating the inevitable death by 1000 cuts to privacy.
[+] bugmen0t|8 years ago|reply
The linked paper to RAPPOR is really, really noteworthy here.

In essence, Firefox will ask itself whether it visited website X and flip a coin and if it's heads, it will lie to the server and send a random boolean. If it's tail, it will not. This way there is no way for anyone (including Mozilla) to know whether you actually visited the website. But the statistics will work out such that the collective data from everyone will give a good representation of all users. I find this a neat technology to collect data in a privacy-preserving way. And there's an opt-out (opt-in won't work because it creates bias and provides messy results).

I really, honestly don't understand why people are so upset.

[+] norea-armozel|8 years ago|reply
I'm not sure why Mozilla needs to track what sites I'm going to but if they add tracking into their browser then I'm just going to have to find another browser or at least put together a build of Firefox without the tracking. It's not so much that I have anything to hide but the fact that I'm not interested in being their product. If they can't remember that they're a nonprofit that's suppose to make a FOSS-based browser which doesn't spy on people and works well with web standards then they just need to shutdown. I know that's extreme but I'm just frustrated with the further corporatization of the Internet even on the margins like Firefox. Everything just has to be a product or a way to commodify the use thereof.
[+] unethical_ban|8 years ago|reply
I am ashamed of the general "sky is falling" tone in this thread. I'm a privacy advocate. I know I'm not a fan of submitting gmy browser history (even domain-only) to another organization. Mozilla has always been the most privacy- and user-focused browser, and I think that history should be taken into consideration before the sky falls.

People are insulting the developers, saying Chinese owned, VPN-operating Opera would be better for privacy... there is a lot of nonsense here.

IMO this is not the most needed feature, and I would be happy for Firefox to keep in mind its reputation as a product focused on user privacy.

[+] yjftsjthsd-h|8 years ago|reply
This might not be so bad as I expected from the title, but implementation details will really matter. If, for instance, they collect exact homepage URLs, they cannot make it anonymous (some site include username as URL components).
[+] yakult|8 years ago|reply
1. Any data collection at all deanonymizes the user, cf panopticlick.

2. Frankly even opt-out is not acceptable. I can't recommend any software that peridically asks users for data access, since there exist non-technical users who have a nonzero chance of clicking yes to everything. If they are related to me in some way this compromises my privacy also.

[+] darrmit|8 years ago|reply
I still use Firefox specifically because of Chrome's privacy concerns and was under the impression after dropping FirefoxOS Mozilla was headed in the right direction.

It seems they've convinced themselves that the only way to improve the product is to collect data on their users, rather than continuing to push the idea of privacy - which, in my opinion, if marketed correctly, could win over a lot of users. The browser is still fundamentally awesome.

This seems like the kind of thing they could push through their TestPilot program and just market it, rather than pushing it to everyone by default. But I imagine they want to push it to everyone specifically so they can take advantage of those who are ignorant to the ability to opt-out.

[+] MichaelMoser123|8 years ago|reply
I guess any browser wants to dominate the platform. It turns into another IE once it succeeds at doing so. Here comes the new boss, same as the old boss.
[+] sp332|8 years ago|reply
It seems your premise is wrong since Firefox's market share has been steadily declining for years. Privacy apparently doesn't matter to that many people.
[+] Multicomp|8 years ago|reply
Yeah, if you could keep your hands off from collecting my data without my consent, that would be great.

Otherwise I might as well just use Chrome. Hopefully some PR guy will pour some water on this before it turns into a dumpster fire.

[+] codedokode|8 years ago|reply
I don't really understand why it is necessary? Cannot they just take top 100 sites from a rating like Alexa? And if they want to evaluate the performace, they could buy a cheap Celeron or Atom-based laptop with Windows and browse those top 100 sites. I am sure that this will give more information than any statistics.