top | item 36630032

Fedora considers “privacy-preserving” telemetry

135 points| pabs3 | 2 years ago |lwn.net | reply

215 comments

order
[+] dschuetz|2 years ago|reply
It does not really matter anymore how bulk usage data collection is called or whether it is "privacy-preserving".

Looking at the current developments in AI, I am concerned that AI models can easily de-anonymize and guess end point users when being fed with "telemetry data" of hundreds of thousands clients.

I hear a lot and read a lot of software and hardware vendors saying that "telemetry" is supposed to somehow magically improve the user experience in the long run, but in actuality software tends to get worse, unstable and less useful.

So, I would like to know how exactly any telemetry data from Fedora Linux clients is going to help them, or how is it going to improve anything.

[+] techwizrd|2 years ago|reply
I disagree wholeheartedly. Privacy-preserving technologies like including privacy-preserving AI (e.g., federated learning, homomorphic encryption) and privacy-preserving data linkage/fusion are really important. They're crucial in my day-to-day work in aviation safety, for example.

And telemetry is important. We have limited resources. How do we determine the number of users impacted by a bug or security vulnerability? Do we have a bug in our updater or localization? Are we maintaining code paths that aren't actually used? Telemetry doesn't magically improve user experience, but I'd rather make decisions based on real data rather than based on the squeakiest wheel in the bug tracker.

We can certainly make flawed decisions based on data, but I'd argue that we're more likely to make flawed decisions with no data.

[+] marginalia_nu|2 years ago|reply
You really don't need AI to do this. Collect enough data points and you can fingerprint basically anyone using very old fashioned techniques. AI doesn't really bring anything new to the table.
[+] MauranKilom|2 years ago|reply
"Only 5% of users use this feature, so we will remove it to save development efforts."

As seen in Firefox..

[+] bo1024|2 years ago|reply
Differential privacy techniques are provably impossible to de-anonymize, if implemented correctly. It is possible. But fraught with possibility for error or manipulation.
[+] JohnFen|2 years ago|reply
> I am concerned that AI models can easily de-anonymize and guess end point users when being fed with "telemetry data" of hundreds of thousands clients.

You don't need AI for this. This is done by real humans right now, using data points correlated from multiple sources.

[+] barbariangrunge|2 years ago|reply
People keep saying "you don't need ai for this." Sure. But to do it at scale, and to intelligently connect disparate kinds of data contextually?

That's time consuming and expensive without ai, so you can't do it at scale to a comprehensive degree. That hasn't been practical until now. It still isn't quite cost effective to do this for every human, everywhere, but soon it will be. Give it 5-10 years

Thanks to ai

[+] didntcheck|2 years ago|reply
That's definitely a legitimate fear, as seen with the AOL controversy [1], but if they're just collecting aggregate statistics it's much less of a risk. I.e.

  User ANON-123 with default font x and locale y and screen resolution z installed package x1
Is clearly a big hazard, but statistics on what fonts, locales, and resolutions have is not really. Even combinations to answer questions like "what screen resolutions and fonts are most used in $locale?" should be safe as long as the entropy is kept low. It is less useful, since you have to decide on your queries a priori rather than being able to do arbitrary queries on historical data, but ethics and safety > convenience

[1] https://en.wikipedia.org/wiki/AOL_search_log_release

[+] agloe_dreams|2 years ago|reply
> Looking at the current developments in AI, I am concerned that AI models can easily de-anonymize and guess end point users when being fed with "telemetry data" of hundreds of thousands clients.

I can almost guarantee you that the US government has a tool where you can input a few posts from a person on an anonymous network and get back all of their public profiles elsewhere. Fingerprinting tools beat all forms of VPNs and the like. Our privacy and anonymity died like maybe two years ago, there is no stopping it.

[+] musicale|2 years ago|reply
> So, I would like to know how exactly any telemetry data from Fedora Linux clients is going to help them, or how is it going to improve anything.

It won't improve anything for users. It might improve something for IBM.

[+] seri4l|2 years ago|reply
Instead of collecting more data why don't do something with the data we already have? A quick look at the Fedora Bugzilla or the GNOME GitLab issues tab suggests the bottleneck doesn't lie in data collection, but in processing.
[+] Aachen|2 years ago|reply
Apples and oranges. Bug reports are filed by a specific type of user and doesn't give a comprehensive view of all bugs. Statistics can also include a lot more than bugs, like "is the number of MIPS users proportional to the amount of extra effort we need to put in to make that happen?" is not a data point you'll find in bugzilla or other tickets.
[+] nixpulvis|2 years ago|reply
Because management can impose these new data collection policies more easily than fixing known issues. It then gives them the potential to find new easier work to have the engineers implement thus making it seem like they are being effective. Meanwhile, it can be unclear how these metrics relate to overall software quality.

Some metrics like startup time and crash counts lead to clear improvement, while others like pointer heatmaps and even more invasive focus tracking are highly dubious in my opinion.

On a related note, I’m coming to the opinion that A/B testing is harder to pull off than many think. And serving a single user both A and B at any point can confuse them and get in the way of their trusting the consistency of the software. Much like how when you search for something twice and get different results in Apple Maps. OK, now I’m just ranting…

[+] hedora|2 years ago|reply
They moved to the CADT model twenty years ago, so the bug reports will never be read.

Now, with telemetry, they can say quantifiable things like "we've driven catastrophic root filesystem loss and permanent loss of network connectivity to 0% of installs!", and prioritize any contrary bug reports away in a data-driven, quantifiable way.

(Because, of course, weak telemetry signals are more valuable than actual humans taking the time to give you feedback on your product.)

[+] AshamedCaptain|2 years ago|reply
Because they will claim that bugzilla is only used by "advanced users" that are not representative of the average user of Fedora.

I absolutely detest that Catch-22 argument, which some distro (not Fedora) actually tried to use on me in the past.

[+] akikoo|2 years ago|reply
> the bottleneck doesn't lie in data collection, but in processing

I created a bug report [1] for tigervnc-server in Fedora because the Fedora documentation [2] for setting up a VNC server didn't match any more what was coming from dnf.

In the bug report I provided the info that would need to be fixed in the documentation. Now after two months, seemingly nothing has been done to fix the situation.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2193384

[2] https://docs.fedoraproject.org/en-US/fedora/latest/system-ad...

[+] nullc|2 years ago|reply
If anything there often appears to be a negative correlation with increased data collection and product quality, in my experience.

I figure it must be due to an abdication of responsibility-- absent information, the product must at least appeal to someone working on it who is making decisions about what is good and what isn't, and so it will also appeal to people who share their preferences. But with the power of DATA we can design products for the 'average user' which can be a product that appeals to no single person at all!

Imagine that you were making shirts. To try to appeal to the most number of people, you make a shirt sized for the average person. But if the distribution of sizes is multimodal or skewed the mean may be a size that fits few or even absolutely no one. You would have done better picking a random person from the factory and making shirts that fit them.

When your problem has many dimensions like system functionality, the number of ways you can target an average but then fit no one as a result increases exponentially.

Pre-corporatized open source usually worked like fitting the random factory worker: developers made software that worked for them. It might not be great for everyone, but it was great for people with similar preferences. If it didn't fit you well you could use a different piece of software.

In corportized open source huge amounts of funding goes into particular solutions, they end up tightly integrated. Support for alternatives are defunded (or just eclipsed by better funded but soulless rivals). You might not want to use gnome, but if you use KDE, you may find fedora's display subsystem crashes out any time you let your monitor go to sleep or may find yourself unable to configure your network interfaces (to cite some real examples of problems my friends of experienced)-- you end up stuck spending your life essentially creating your own distribution, rather than saving the time that you hoped to save by running one made by someone else.

Of course, people doing product design aren't idiots and some places make an effort to capture multimodality though things like targeting "personas"-- which are inevitably stereotyped, patronizing, and overly simplified (like assuming a teacher can't learn to use a command prompt or a bug tracker). Or through efforts like user studies but these are almost always done with very unrepresentative users, people with nothing better to do then get paid $50 to try someting out, and you learn only about the experience of people with no experience and no real commitment or purpose to their usage (driving you to make an obscenely dumbed down product). ... or by things like telemetry, which even at their best will fail to capture things like "I might not use the feature often, but it's a huge deal in the rare events I need it." or get distorted by the preferences of day-0 users, some large percentage of which will decide the whole thing isn't for them no matter what you do.

So why would non-idiots do things that don't have good results? As sibling posts note, people are responding to the incentives in their organizations which favor a lot of wheel spinning on stuff that produces interesting reports. People wisely apply their efforts towards their incentives-- their definition of a good result doesn't need to have much relation to any external definition of good.

[+] dschuetz|2 years ago|reply
Reading this: https://www.dedoimedo.com/computers/telemetry-should-you-all... I've stumbled on this:

> Optional telemetry, of course, but again - this creates a selective and unpredictable reality. Ordinary people don't care either way, and nerds will always make deliberate choices that often have nothing to do with product or profit or anything else entirely.

So, by that logic, users who opt out of telemetry are aware of why they are doing it. People who don't care share their chaotic usage patterns. This creates a false picture of usage reality, and makes software worse. In conclusion, there are only two remaining choices: 1. Make telemetry non-optional 2. Ditch telemetry and rely on QA studies

[+] JohnFen|2 years ago|reply
> 1. Make telemetry non-optional

But then users who care about this issue will just block the application's ability to phone home, or use a different one that doesn't spy (my definition of spying is any time data about me or my machines is collected without my informed consent).

[+] denton-scratch|2 years ago|reply
> So, by that logic, users who opt out of telemetry are aware of why they are doing it.

Ambiguous: they're aware of why who is doing it? I opt out of telemetry because I don't know why they're doing it - the data collectors. I mean, I know why they say they're doing it, but I don't know if it's true.

I also don't want my computing resources and network bandwidth used to further a goal that I might not support. Even if the only reason for collecting data really is to "improve the product", perhaps that'll result in them making the product dependent on systemd, which from my POV would be an adverse outcome.

[+] 0xbadc0de5|2 years ago|reply
As someone who's used Fedora for 20 years (since before it existed, ie: RH6), this decision would be a show stopper. No, Fedora, you're doing just fine without telemetry! If you try to force it, you'll lose a ton of loyal users. Find out whoever is pushing this idiocy and fire them, quick.
[+] registeredcorn|2 years ago|reply
Any suggestions for an alternative to switch to?
[+] W4RH4WK55|2 years ago|reply
Do we have any indication that telemetry leads to an actually improvent of the software's overall quality at this point?

It seems to me that even with excessive levels of telemetry, software remains buggy and sluggish most of the time.

[+] hedora|2 years ago|reply
I think telemetry collection is a symptom of deeper organizational issues.

For instance, I've never worked with a competent release manager who said "we need more field telemetry!"

Instead, the good ones invariably want improved data mining of the bugtracker, and want to increase the percentage of regression bugs that are caught in automated testing. They also generally want to increase the percentage of automated test failures that are root-caused.

[+] 2OEH8eoCRo0|2 years ago|reply
https://discussion.fedoraproject.org/t/f40-change-request-pr...

> I can speak as a GNOME developer—though not on behalf of the GNOME project as a community—and say: GNOME has not been “fine” without telemetry. It’s really, really hard to get actionable feedback out of users, especially in the free and open source software community, because the typical feedback is either “don’t change anything ever” or it comes with strings attached. Figuring out how people use the system, and integrate that information in the design, development, and testing loop is extremely hard without metrics of some form. Even understanding whether or not a class of optimisations can be enabled without breaking the machines of a certain amount of users is basically impossible: you can’t do a user survey for that.

[+] worble|2 years ago|reply
No, but it does allow them to makes lots of pretty graphs they can show to upper management in those long multi hour meetings.
[+] rcxdude|2 years ago|reply
The problem being that inevitably the 'improvement' gets measured through the same telemetry figures which are being optimised, so of course it's percieved by developers has helping them improve things.
[+] romanovcode|2 years ago|reply
To add to the point: Alphabet has probably most data than any other company (except FB I suppose) and they still can't release a good product that people will actually use no matter how much data they have.
[+] tokai|2 years ago|reply
Annoying, I have used Fedora for many years now and I hadn't planned to stop that. Even with the slim chance that they don't go through with this, it tells that they have lost their bearings. Oh well it will be nostalgic doing a bit of distro hopping again.
[+] doodlesdev|2 years ago|reply
Where are you planning to distro hop to? I find Fedora has a special balance between bleeding edge and stable that no other distribution I tried achieves.
[+] lakomen|2 years ago|reply
IBM strikes again.

The recent closing of build files of RHEL, now telemetry in Fedora. At least it doesn't have ads like Ubuntu amirite...

Spyware everywhere.

Not that I'd ever use Fedora as my main Desktop OS. Arch has won that battle. And if I want a simple installer, where everything just works, Manjaro.

[+] heywoodlh|2 years ago|reply
> if I want a simple installer where everything just works

I actually feel like the archinstall[0] tool included in the official Arch ISOs really nails easy installation. It's an official way to install Arch that is incredibly user friendly and fast, in my opinion.

[0] https://wiki.archlinux.org/title/Archinstall

[+] rwmj|2 years ago|reply
It's so tedious every time people ascribe some action of Fedora or Red Hat to IBM. This is nothing to do with IBM. Red Hat can and does make these decisions all by itself.
[+] kleinsch|2 years ago|reply
Would make more sense to link to the actual proposal rather than a two sentence summary

https://discussion.fedoraproject.org/t/f40-change-request-pr...

[+] counternotions|2 years ago|reply
> One of the main goals of metrics collection is to analyze whether Red Hat is achieving its goal to make Fedora Workstation the premier developer platform for cloud software development. Accordingly, we want to know things like which IDEs are most popular among our users, and which runtimes are used to create containers using Toolbx.
[+] Aachen|2 years ago|reply
Nobody is reading even the summary anyway. They saw the word telemetry in the headline and had an opinion.
[+] bogwog|2 years ago|reply
I find it hard not to be cynical about this after everything IBM has been doing. I wonder if this is an attempt to identify exactly which organizations are using Fedora in order to upsell or do another CentOS-like rug pull?

Luckily, OpenSuse Tumbleweed looks to be a pretty good alternative to Fedora. There’s even an immutable version of it, like Silverblue!

[+] RcouF1uZ4gsC|2 years ago|reply
> We believe an open source community can ethically collect limited aggregate data on how its software is used

For me the big question is why? Proprietary software needs telemetry because the user is not in control and features are only added by the owner of the software, thus the centralized owner needs to know what features to add.

Open source is different. It is decentralized. Anybody can tweak the system to make it better for themselves. In addition, as opposed to proprietary software which sells licenses, the most common open source monetary model seems to be selling support in which case the people buying support can ask for the feature without telemetry.

To put it another way, you need telemetry for a cathedral since decisions are made centrally. A bazaar doesn’t need telemetry, since decisions are decentralized.

[+] jacquesm|2 years ago|reply
WTF is wrong with these companies? Don't they understand that the only reason they are in business is because they do not do such things?
[+] ilc|2 years ago|reply
Ex Red-Hat:

I think the Red Hat eco-system is turning IBM Blue.

I'll take this as the warning to move off Fedora, to more forward looking distributions.

Red Hat already lost my work laptop.... now it'll lose my personal one :(

[+] justinclift|2 years ago|reply
Reading some of the Fedora discussion thread about this, the attitude of Fedora developers is breathtaking:

https://discussion.fedoraproject.org/t/f40-change-request-pr...

    Opt-in telemetry is garbage. I’m going to stop responding to comments that are
    requesting opt-in because I’ve made my position clear: users who opt-in are not a
    representative sample, and that opt-in data will not be accurate or useful.
Accurately summarised to "Fuck off dickheads, your privacy is getting in the way of us doing development!".
[+] justinclift|2 years ago|reply
The actual proposal:

https://lwn.net/ml/fedora-devel/CAJqbrbeOZrHvYjvMCc=qGZD_VXB...

    === What data might we collect? ===

    We are not proposing to collect any [...] particular metrics
    just yet, because a process for Fedora community approval of
    metrics to be collected does not yet exist. That said, in the
    interests of maximum transparency, we wish to give you an idea
    of what sorts of metrics we might propose to collect in the
    future.

    One of the main goals of metrics collection is to analyze
    whether Red Hat is achieving its goal to make Fedora Workstation
    the premier developer platform for cloud software development.
    Accordingly, we want to know things like which IDEs are most
    popular among our users, and which runtimes are used to create
    containers using Toolbx.

    Metrics can also be used to inform user interface design
    decisions.  For example, we want to collect the clickthrough
    rate of the recommended software banners in GNOME Software to
    assess which banners are actually useful to users. We also want
    to know how frequently panels in gnome-control-center are
    visited to determine which panels could be consolidated or
    removed, because there are other settings we want to add, but
    our usability research indicates that the current high quantity
    of settings panels already makes it difficult for users
    to find commonly-used settings.

    Metrics can help us understand the hardware we should be
    optimizing Fedora for. For example, our boot performance on hard
    drives dropped drastically when systemd-readahead was removed.
    Ubuntu has maintained its own readahead implementation, but
    Fedora does not because we assume that not many users use Fedora
    on hard drives. It would be nice to collect a metric that
    indicates whether primary storage is a solid state drive or a
    hard disk, so we can see actual hard drive usage instead of
    guessing. We would also want to collect hardware information
    that would be useful for collaboration with hardware vendors
    (such as Lenovo), such as laptop model ID.

    Other Fedora teams may have other metrics they wish to collect.
    For example, Fedora localization wishes to count users of
    particular locales to evaluate which locales are in poorer shape
    relative to their usage.

    This is only a small sample of what we might want to know; no
    doubt other community members can think of many more interesting
    data points to collect.
That last piece "no doubt other community members can think of many more interesting data points to collect" sounds pretty bad for telemetry that's enabled by default, with people having to opt out of it. :(
[+] gigel82|2 years ago|reply
There's no such thing as privacy-preserving telemetry. How do you retrieve the telemetry from the device? Via networking, right? BAM, that's an IP address leak which is PII. We don't need to go any further than that.
[+] gavinhoward|2 years ago|reply
I was going to install Fedora on one of my machines soon. Not anymore.
[+] bravetraveler|2 years ago|reply
They need to do more consideration if their answer is to tie it to gnome-initial-setup... unless I misunderstand, and this is purely a GNOME-cooked thing

I fear what that means where the preference is saved, and how spins (or users who simply choose to not have GNOME), may feasibly opt out

Where's the demarcation? Is this some dconf thing that a timer will read, a service, or what?

I lack trust in their handling in certain matters. For example, every Fedora device 'phones home' for AP checks:

  $ cat /usr/lib/NetworkManager/conf.d/20-connectivity-fedora.conf
  [connectivity]
  enabled=true
  uri=http://fedoraproject.org/static/hotspot.txt
  response=OK
  interval=300
Including those that are wired... and that's rather unnecessary. I generally get a sense of haste these decisions, lately.

To disable it, mask the /usr file with one in /etc:

    touch /etc/NetworkManager/conf.d/20-connectivity-fedora.conf
Another example: systemd-oomd on anything with > 64GB installed; entire user scopes randomly killed with oodles free.

I say this in the softest way possible, I don't really mind it... but it raises an eyebrow towards eagerness/attention.

<snide>They already get to know how well they're doing being "the premier OS" by seeing how often they get hit.</snide>

[+] TrueDuality|2 years ago|reply
I see people complaining about instinctive reactions about the usage of the word telemetry, but they're rightly justified in those reactions. People have those instinctive reactions for a very good reason even with this specific proposal. If you read the discussion post, the following becomes clear:

* The proposer has clearly not done any research on how to actually collect anonymous data (they'd never heard of differential privacy for example).

* They want a plug and play solution (they specifically say they don't want to do more work than that)

* They are not open to discussing privacy regulations such as GDPR

* They are not willing to bend on the most contentious points of their proposal

* The system they want to use collects invasive metrics that can be de-anonymized and has only been used by a niche distribution

Because the de-anonymization bit might not be clear, let me summarize some of the things that the Endless OS metrics collect:

* Country

* Location based on IP address to within 1 degree lat/long

* Your specific hardware profile

* Daily report that includes your hardware profile, along with the number of times the check ins have occurred in the past

* Detailed program usage (every start / stop)

* An unspecified series of additional metrics that can be sent from anywhere else on the system via a dbus interface

Additional this proposal wants to explicitly collect:

* What packages and versions of such are installed

* Specific application usage metrics (the example they give is the gnome settings panel)

They discard the IP address, but how hard do you think it is to differentiate users based on the combination of hardware profile, +/- 1 degree of location accuracy, their specific set of packages (and knowing the history of package installs/uninstalls already through their package manager). The proposal doesn't meet its stated intentions of being anonymous, and the proposer actively understands that users don't want this but believe their desire for the metrics overrides the end users desire of not being tracked.

[+] motohagiography|2 years ago|reply
The approach to this is a half baked idea and an unforced error. If I were contributing to another distro, I would say I was building a general differential privacy and zk-snark library and accompanying services stack that developers could use for what they found interesting. Then once it had some burn in, I'd launch a limited beta where it was trustworthy enough that participants could get useful data from the rest of the clusters without exposing the other participants to risk.

Maybe we need a particapatory privacy stack that produces valuable anonymous data and also contributes to it. You might be able to do it with homomorphic arithmetic that increments defined counters (like the hash of a package or version), and we already have distributed ledgers for collecting and distributing the data. We can do queries with differential privacy, and zksnarks.

It's not a viable product because people who actually use data want the real data, the discretion is power to them, but as a tool for coordinating a cooperative effort, we need to build something new to say that this is how we do things now.