This[0] video from Apple's WWDC gives a nice overview of how Differential Privacy is being used in iOS. Basically, Apple can collect and store its users’ data in a format that lets it glean useful info about what people do, say, like and want. But it can't extract anything about a single specific one of those people that might represent a privacy violation. And neither can hackers or intelligence agencies.
Its cool they are using DP for some analytics. But its not quite the holy grail Apple and its fans has been selling it as. Because any analytics campaign using DP will always eventually average out to pure noise or end up being non-anonymous.
Differential privacy is cool. However, I looked at Google's RAPPOR algorithm (deployed in Chrome, and clearly designed with real-world considerations in mind) in some depth, and I found that RAPPOR needs millions to billions of measurements to become useful, even while exposing users to potentially serious security risks (epsilon = ln(3), so "bad things become at most 3x more likely"). Much better than doing nothing, but we'll continue to need non-cryptographic solutions (NDA's etc.) for many cases.
Aaron Roth was my professor at Penn. He's definitely the expert on differential privacy. Fun fact: his dad won the Nobel Prize in Economics a few years ago.
Take GPS data, for example: NYC has released a taxicab dataset showing the "anonymized" location of every pickup and dropoff.
This is bad for privacy. One attack is that now if you know when and where someone got in a cab (perhaps because you were with them when they got in), you can find out if they were telling the truth to you about where they were going -- if there are no hits in the dataset showing a trip from the starting location that you know to the ending location that they claimed, then they didn't go where they said they did.
Differential privacy researchers claim to help fix these problems by making the data less granular, so that you can't unmask specific riders: blurring the datapoints so that each location is at a city block's resolution, say. But that doesn't help in this case -- if no-one near the starting location you know went to the claimed destination, blurring doesn't help to fix the information leak. You didn't need to unmask a specific rider to disprove a claim about the destination of a trip.
I think that flaws like these mean that we should just say that GPS trip data is "un-de-identifiable". I suspect the same is true for all sorts of other data. For example, Y chromosomes are inherited the same way that surnames often are, meaning that you can make a good guess at the surname of a given "deidentified" DNA sequence, and thus unmask its owner from a candidate pool, given a genetic ancestry database of the type that companies are rapidly building.
The attack you suggest is ruled out by differential privacy. The precise guarantee is a bit complicated. The first thing to note is that the output of a differentially private mechanism must be random. Then, the guarantee is that Pr[output] does not change by very much whether or not you are included in the dataset. In other words, even if you were omitted from the dataset, there the chance that the algorithm produced the same result is very similar.
This definition rules out the attack you suggest. In particular, if you are removed from the dataset, then the probability of the output (i.e., a ride starts in the region) goes from very large to very small. Therefore, the algorithm you describe (i.e., adding noise to the start location) is not actually differentially private.
The confusion arises because oftentimes adding noise is sufficient. For example, the average of n real numbers in [0,1] is affected by at most 1/(n-1) if you delete one point from the dataset. Therefore, you can just add a little bit of noise and the dataset becomes differentially private.
For the dataset you describe, a sibling comment proposed the correct mechanism -- you have to add noise to the count returned by the query, not the start location. (Technically I think you could just add noise to the start location like you propose, but the amount of noise would have to be large enough that all the start locations overlap by a sufficient amount.)
It's been a couple years since I read the literature so I might be wrong, but iirc differential privacy fuzzes the number of matches instead of the data points themselves.
That is, in your example differential privacy would precisely display the start and end points but the number of riders would be fuzzed (perhaps showing 4 departures and 2 arrivals at the respective location/time).
Also, iirc, offline differential privacy can outright remove data points while online system can block sufficiently deanonymizing queries.
One of the criticisms of differential privacy is that it can render the data useless... I definitely found that held true for my data. In the end, my company simply decided against releasing (or collecting) any customer data.
You seem to have an incorrect version of differential privacy in mind. "Blurring the datapoints" as you describe would not satisfy differential privacy at all.
DP would not allow an information leak of this kind, unless the data set was modeled in a very silly way.
> if there are no hits in the dataset showing a trip from the starting location that you know to the ending location that they claimed, then they didn't go where they said they did.
As others have pointed out, differential privacy is not obtained just by blurring the points. But I'd like to point out that what you wrote above is closely related to exactly the original definition of DP.
DP guarantees that you cannot get a statistically significant difference between a dataset with one element removed (e.g. a trip in your case). That implies that the difference is also insignificant if you add a point.
In other words DP addresses exactly your concern here.
At one point, I know someone who wanted to give money to a large medical organization so that they could show their patients the tradeoff between various interventions. (efficacy vs side-effects).
It was going to be donated money to build an app that belonged to the institution.
The institution would not let their own researches publish the data on the app even though it was anonymous. They didn't want to take the risk.
It would be great if this lead to accepted protocols that made it so that people didn't have to think about it. "Oh yeah, we'll share it using DP" and then people could move ahead using data.
Of course personally identifiable information will be extracted despite this model. "Differential Privacy" is cynical academic malpractice -- selling a reputation so that when individuals are harmed in the course of commercial exploitation of the purportedly anonymized data, the organizations that profited can avoid being held responsible.
We never learn, because there is money to be made if we pretend that anonymization works.
To be clear, I think you're right that no tracking is better than trying to protect data.
However, it's important to understand that 'anonymization' is very different than the practice of "Differential Privacy."
I'm no expert but here is how I understand it as a simplified example:
Imagine your information is stored in a spreadsheet. It is storing your weight, height, zipcode, age and name.
The 'anonymization' spreadsheet would still have a unique row dedicated to you (similar to a spreadsheet) and it may replace your name with an ID# or an encrypted string. Now, just like in the AOL dataleak that information being stored as a single line item is still easy to backtrack as there is likely no one else with your weight, height and age combination in your zipcode. So a hacker can identify a single person.
Differential Privacy would store information differently, perhaps in separate spreadsheets, one that is list of heights, one that is a list of weights, etc, etc. No two spreadsheets would store the information in the same order (#3 on the height list would not be #3 on the weight list) and it may even contain some incorrect dummy information.
There would be some sort of algorithmic relation however that allows a system to create outputs in which the data has meaningful information (trends, means, standard deviations etc) but it can not be back-tracked to identify any single unique row.
Differential Privacy allows us to see the trend "Males age 45 are taller on average than Females age 45" but not say "User #155083 is age 45, weighs 195lbs, and lives in zipcode 10001"
That's a big difference in privacy, and while it isn't perfect it is a step in the right direction. While I wish more companies would adopt a no-data policy, it is at least better that they are responsible as can be with the data they have.
I confused, you sound against the idea of differential privacy, even though the foundation of differential privacy is that anonymization DOESN'T work to protect people's privacy. In fact, the canonical example used in the differential privacy field of failed anonymization is the AOL fiasco.
I did a lot of work in healthcare IT and have also obsessed about voter privacy (protecting the secret ballot).
I've seen de-anonymization in practice. I also know there's a huge chasm between best available science (and practices) vs the real world. (Dumb example: mis-redacting PDFs with opaque boxes instead of removing the text.)
#1 At best, like crypto, differential privacy may offer temporary protection, to data re-used (shared, in transit), if given assumptions are preserved.
#2 Also, like crypto, I have no confidence that anyone, any where will implement DP correctly, or even be able to prove they've done it correctly.
#3 The original data is stored somewhere. There is no DP story for mitigating leaks.
Given my disappointment, I believe (but cannot prove) there's two worthwhile strategies worth exploring.
First, contracts to use case box and time box the data, stating how and when shared data may be used, and then a drop dead time when shared data must be destroyed. Part of this contract could be expanded to include parameters for differential privacy. One org I work with has these policies. Alas, Thatcherite "trust, but verify" is tough. We add fake data (honeypot-esque) and have caught cheaters.
Second, I'm keen to further explore translucent databases, where data in situ (at rest) is encrypted.
Lastly, I'm always looking to see who is working in this space, and what they're doing. I'd like to believe that someone will crack this nut.
That data (AOL search data) wasn't intentionally released so while pseudonyms had been used, no serious effort at fully (or more fully) anonymizing or protecting users had occurred. Here the intent is to be able to produce data sets where individual identification will be statistically unlikely if not impossible (by fuzzing the data), or where individuals can refute the data because there's a statistical chance the data is a lie (probability biased in favor of truth so that the aggregate data is still useful).
[+] [-] eddyg|8 years ago|reply
[0] https://developer.apple.com/videos/play/wwdc2016/709/?time=8... (the "Transcript" tab has the text of the video if you want to read instead of watch.)
[+] [-] devsquid|8 years ago|reply
Heres a great interview from the ms researched that invented the technique http://www.sciencefriday.com/segments/crowdsourcing-data-whi...
One of the quotes I always liked from it is "any overly accurate estimates of too many statistics is blatantly non-private"
[+] [-] JoachimSchipper|8 years ago|reply
Differential privacy is cool. However, I looked at Google's RAPPOR algorithm (deployed in Chrome, and clearly designed with real-world considerations in mind) in some depth, and I found that RAPPOR needs millions to billions of measurements to become useful, even while exposing users to potentially serious security risks (epsilon = ln(3), so "bad things become at most 3x more likely"). Much better than doing nothing, but we'll continue to need non-cryptographic solutions (NDA's etc.) for many cases.
[+] [-] BucketSort|8 years ago|reply
[+] [-] jmount|8 years ago|reply
[+] [-] AdamSC1|8 years ago|reply
You do lose out on a lot of human bias in the research process, but you also create blind errors that are hard to validate.
I know in my work there is plenty of times I run analysis and go back and manually check some entries as a sanity check - pros and cons here!
[+] [-] jey|8 years ago|reply
(No, I haven't read it...)
[+] [-] habosa|8 years ago|reply
[+] [-] cjbprime|8 years ago|reply
Take GPS data, for example: NYC has released a taxicab dataset showing the "anonymized" location of every pickup and dropoff.
This is bad for privacy. One attack is that now if you know when and where someone got in a cab (perhaps because you were with them when they got in), you can find out if they were telling the truth to you about where they were going -- if there are no hits in the dataset showing a trip from the starting location that you know to the ending location that they claimed, then they didn't go where they said they did.
Differential privacy researchers claim to help fix these problems by making the data less granular, so that you can't unmask specific riders: blurring the datapoints so that each location is at a city block's resolution, say. But that doesn't help in this case -- if no-one near the starting location you know went to the claimed destination, blurring doesn't help to fix the information leak. You didn't need to unmask a specific rider to disprove a claim about the destination of a trip.
I think that flaws like these mean that we should just say that GPS trip data is "un-de-identifiable". I suspect the same is true for all sorts of other data. For example, Y chromosomes are inherited the same way that surnames often are, meaning that you can make a good guess at the surname of a given "deidentified" DNA sequence, and thus unmask its owner from a candidate pool, given a genetic ancestry database of the type that companies are rapidly building.
[+] [-] obastani|8 years ago|reply
This definition rules out the attack you suggest. In particular, if you are removed from the dataset, then the probability of the output (i.e., a ride starts in the region) goes from very large to very small. Therefore, the algorithm you describe (i.e., adding noise to the start location) is not actually differentially private.
The confusion arises because oftentimes adding noise is sufficient. For example, the average of n real numbers in [0,1] is affected by at most 1/(n-1) if you delete one point from the dataset. Therefore, you can just add a little bit of noise and the dataset becomes differentially private.
For the dataset you describe, a sibling comment proposed the correct mechanism -- you have to add noise to the count returned by the query, not the start location. (Technically I think you could just add noise to the start location like you propose, but the amount of noise would have to be large enough that all the start locations overlap by a sufficient amount.)
[+] [-] losteric|8 years ago|reply
That is, in your example differential privacy would precisely display the start and end points but the number of riders would be fuzzed (perhaps showing 4 departures and 2 arrivals at the respective location/time).
Also, iirc, offline differential privacy can outright remove data points while online system can block sufficiently deanonymizing queries.
One of the criticisms of differential privacy is that it can render the data useless... I definitely found that held true for my data. In the end, my company simply decided against releasing (or collecting) any customer data.
[+] [-] bo1024|8 years ago|reply
DP would not allow an information leak of this kind, unless the data set was modeled in a very silly way.
[+] [-] arnarbi|8 years ago|reply
As others have pointed out, differential privacy is not obtained just by blurring the points. But I'd like to point out that what you wrote above is closely related to exactly the original definition of DP.
DP guarantees that you cannot get a statistically significant difference between a dataset with one element removed (e.g. a trip in your case). That implies that the difference is also insignificant if you add a point.
In other words DP addresses exactly your concern here.
[+] [-] rectang|8 years ago|reply
So long as signal remains in the collective data, as the state of the art advances individual data streams will inevitably emerge.
[+] [-] projectramo|8 years ago|reply
It was going to be donated money to build an app that belonged to the institution.
The institution would not let their own researches publish the data on the app even though it was anonymous. They didn't want to take the risk.
It would be great if this lead to accepted protocols that made it so that people didn't have to think about it. "Oh yeah, we'll share it using DP" and then people could move ahead using data.
[+] [-] rectang|8 years ago|reply
https://en.wikipedia.org/wiki/AOL_search_data_leak
Of course personally identifiable information will be extracted despite this model. "Differential Privacy" is cynical academic malpractice -- selling a reputation so that when individuals are harmed in the course of commercial exploitation of the purportedly anonymized data, the organizations that profited can avoid being held responsible.
We never learn, because there is money to be made if we pretend that anonymization works.
[+] [-] AdamSC1|8 years ago|reply
However, it's important to understand that 'anonymization' is very different than the practice of "Differential Privacy."
I'm no expert but here is how I understand it as a simplified example:
Imagine your information is stored in a spreadsheet. It is storing your weight, height, zipcode, age and name.
The 'anonymization' spreadsheet would still have a unique row dedicated to you (similar to a spreadsheet) and it may replace your name with an ID# or an encrypted string. Now, just like in the AOL dataleak that information being stored as a single line item is still easy to backtrack as there is likely no one else with your weight, height and age combination in your zipcode. So a hacker can identify a single person.
Differential Privacy would store information differently, perhaps in separate spreadsheets, one that is list of heights, one that is a list of weights, etc, etc. No two spreadsheets would store the information in the same order (#3 on the height list would not be #3 on the weight list) and it may even contain some incorrect dummy information.
There would be some sort of algorithmic relation however that allows a system to create outputs in which the data has meaningful information (trends, means, standard deviations etc) but it can not be back-tracked to identify any single unique row.
Differential Privacy allows us to see the trend "Males age 45 are taller on average than Females age 45" but not say "User #155083 is age 45, weighs 195lbs, and lives in zipcode 10001"
That's a big difference in privacy, and while it isn't perfect it is a step in the right direction. While I wish more companies would adopt a no-data policy, it is at least better that they are responsible as can be with the data they have.
[+] [-] Hoobentube|8 years ago|reply
[+] [-] specialist|8 years ago|reply
I did a lot of work in healthcare IT and have also obsessed about voter privacy (protecting the secret ballot).
I've seen de-anonymization in practice. I also know there's a huge chasm between best available science (and practices) vs the real world. (Dumb example: mis-redacting PDFs with opaque boxes instead of removing the text.)
#1 At best, like crypto, differential privacy may offer temporary protection, to data re-used (shared, in transit), if given assumptions are preserved.
#2 Also, like crypto, I have no confidence that anyone, any where will implement DP correctly, or even be able to prove they've done it correctly.
#3 The original data is stored somewhere. There is no DP story for mitigating leaks.
Given my disappointment, I believe (but cannot prove) there's two worthwhile strategies worth exploring.
First, contracts to use case box and time box the data, stating how and when shared data may be used, and then a drop dead time when shared data must be destroyed. Part of this contract could be expanded to include parameters for differential privacy. One org I work with has these policies. Alas, Thatcherite "trust, but verify" is tough. We add fake data (honeypot-esque) and have caught cheaters.
Second, I'm keen to further explore translucent databases, where data in situ (at rest) is encrypted.
Lastly, I'm always looking to see who is working in this space, and what they're doing. I'd like to believe that someone will crack this nut.
[+] [-] Jtsummers|8 years ago|reply