is pseudonymisation really a new thing? We do this with prod data for our dev and staging database. a subset of a dump is processed and names, emails and other PIIs are replaced with random strings, etc.
Not only it makes the data handling safe and anonymised, you also avoid crazy stuff like mistakenly sending a batch email to prod users while you are testing stuff in dev/prod (been there, done that).
I found that clever when I first saw we were doing that, but it seemed simple enough that I just assumed every company did it.
> is pseudonymisation really a new thing? We do this with prod data for our dev and staging database. a subset of a dump is processed and names, emails and other PIIs are replaced with random strings, etc.
No it has been done for years to varying effectiveness, and in my experience it is almost always done badly (only randomizing indexes and not all data) unfortunately.
Note that even a small subset of personal data is enough to trace it to a person, so either randomize all, or don't bother and set other efforts in place to guard data. Like limiting access, getting very small and rotating subsets of data randomized over all data and guarding access to only test engineers (which you already described).
I worked at a medium sized insurance company for a while and the data available to me was only available to me (all engineers had different, very small subsets, only need-to-know) and there were very stringent protection in place like only localized access (no internet on the machine), no laptops, thus only work on location and NEVER SHARE OR ELSE legal paperwork. This was due to the nature of the work which could almost not be done without real world data since it covered fraud detection (on which I won't elaborate further). Other engineer which didn't need the data got randomized databases generated, not extracted.
It's not as simple as that. What's the k-anonymity on your datasets? I'd be mightily impressed if anyone at your company had ever calculated it, yet 'pseudonymisation' is meaningless without quantitative assessments of its results.
From what I've seen in research, when you get an 'anonymized' dataset from e.g. a government institute, hospital or school, someone will have replaced the names in the 'Name' column with 'Subject 1', 'Subject 2' and so on, and if you're lucky they'll have removed the DoB column. How many people are there in the average organization who are even qualified to have an opinion on whether a dataset is sufficiently anonymous for a certain purpose?
The first few years of GDPR lawsuits are going to be about obvious things, hopefully we'll get to see a few more interesting ones about stuff like this once that basic stuff is settled :)
I think it is worth noting that pseudonymization is not just this big loophole in the GDPR and pseudonomyzed data can still be considered personal data and fall under GDPRs jurisdiction.
Pseudonymisation != Anonymization. And as the article 29 working party has concluded [0] might sometimes not be sufficient to protect users privacy.
The result is a new set of data that contains no personal information, but retains the format and statistics of the original. The only way that each field in the new data set can be returned to its old state is by applying the key used to generate the hash
these keys are held by the accounts teams. The development teams working on the pseudonymous data never see them
Right ... but I would feel better if I supply this hash/key back to them. I understand I can request erasure, but I would like the option to request "hashed" (or a user friendlier term) when I want to keep my data on their server, but I control it.
You supply this key? So you are implying that you store the key and the company doesn't have access to it, hence doesn't have access to your data?
Then is there any reason for the company to store your data?
What if you loose the key? You can't restore your account? I'd be interested to know the % of users loosing their password and needing to use "forgot password" feature. How would that work with a key you own?
That's a very interesting idea. How would you implement it? I guess that it's one of those things that won't fly without industry acceptance, but a proof of concept would be very useful, I think...
[+] [-] jypepin|8 years ago|reply
Not only it makes the data handling safe and anonymised, you also avoid crazy stuff like mistakenly sending a batch email to prod users while you are testing stuff in dev/prod (been there, done that).
I found that clever when I first saw we were doing that, but it seemed simple enough that I just assumed every company did it.
[+] [-] consp|8 years ago|reply
No it has been done for years to varying effectiveness, and in my experience it is almost always done badly (only randomizing indexes and not all data) unfortunately.
Note that even a small subset of personal data is enough to trace it to a person, so either randomize all, or don't bother and set other efforts in place to guard data. Like limiting access, getting very small and rotating subsets of data randomized over all data and guarding access to only test engineers (which you already described).
I worked at a medium sized insurance company for a while and the data available to me was only available to me (all engineers had different, very small subsets, only need-to-know) and there were very stringent protection in place like only localized access (no internet on the machine), no laptops, thus only work on location and NEVER SHARE OR ELSE legal paperwork. This was due to the nature of the work which could almost not be done without real world data since it covered fraud detection (on which I won't elaborate further). Other engineer which didn't need the data got randomized databases generated, not extracted.
[+] [-] roel_v|8 years ago|reply
From what I've seen in research, when you get an 'anonymized' dataset from e.g. a government institute, hospital or school, someone will have replaced the names in the 'Name' column with 'Subject 1', 'Subject 2' and so on, and if you're lucky they'll have removed the DoB column. How many people are there in the average organization who are even qualified to have an opinion on whether a dataset is sufficiently anonymous for a certain purpose?
The first few years of GDPR lawsuits are going to be about obvious things, hopefully we'll get to see a few more interesting ones about stuff like this once that basic stuff is settled :)
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] tephra|8 years ago|reply
Pseudonymisation != Anonymization. And as the article 29 working party has concluded [0] might sometimes not be sufficient to protect users privacy.
[0] http://ec.europa.eu/justice/article-29/documentation/opinion...
[+] [-] neonate|8 years ago|reply
[+] [-] pcunite|8 years ago|reply
these keys are held by the accounts teams. The development teams working on the pseudonymous data never see them
Right ... but I would feel better if I supply this hash/key back to them. I understand I can request erasure, but I would like the option to request "hashed" (or a user friendlier term) when I want to keep my data on their server, but I control it.
[+] [-] jypepin|8 years ago|reply
What if you loose the key? You can't restore your account? I'd be interested to know the % of users loosing their password and needing to use "forgot password" feature. How would that work with a key you own?
[+] [-] mikekchar|8 years ago|reply
[+] [-] patrickmn|8 years ago|reply
[+] [-] mx92je082|8 years ago|reply
[deleted]