top | item 18643764

(no title)

pinneycolton | 7 years ago

I work with this type of data and I assure you that the results are quite plausible. The original hypothesis was tested against US census data. See "Experiment B" here:

https://dataprivacylab.org/projects/identifiability/paper1.p...

I'll add that there are far fewer live births, per day, in the US than there are zip codes. I agree that some highly populated areas that are problematic, but this may be the only reason that 87.1% number isn't 100%!

discuss

order

nkurz|7 years ago

It may have gotten lost in the revisions of my comment, but I was consciously not trying to argue that Sweeney's numbers were wrong, only that Cook's explanation was lacking since it doesn't discuss distribution. I hadn't yet looked at the paper.

That said, looking at the paper you linked now, I don't see how Cook's simulation (simulation, not explanation) and Sweeney's paper can both be correct. Cook got 84-85% identifiable assuming uniform age distribution and identical population per zipcode. At the bottom of Figure 14, Sweeney says 87% for the US as a whole.

Shouldn't any non-uniformity (of zip code population or age clustering) act only to reduce the percent of the population that is identifiable? That is, shouldn't Cook's simulation with flat age distribution and equal zipcode populations be an upper bound on identifiability? Since Cook's simulation code looks fine, this makes me suspect that there's something off about Sweeney's analysis.

Is the 87% perhaps an average of the state percentages, and not properly weighted by state population? Or maybe an average across age classes not weighted by population of that class? Oh, I don't know about those, but maybe I see a bigger issue now...

In Section 4.3.1, Sweeney defines the "Number of subjects uniquely identified in a subdivision of a geographical area". But this isn't a simulation like Cook did, she's just using a binary yes/no depending on whether the subpopulation in each age class exceeds a numerical threshold:

  if population(zi, a) ≥ |Qa|, then ID_aZi= population(zi, a)   
  else ID_aZi = 0.
While it's nice that it's clearly defined, I don't think this yields a "percent identifiable" that matches up with Cook's simulation, nor with any common usage of the term. Also (while I'm being picky) isn't the definition backward? If we were to go with this arbitrary definition, wouldn't we want Id_zi to equal zero when the population is less than the threshold? I presume the direction of equality is just a typo, but if the paper is using a hard threshold rather than some more rigorous approach like Cook's simulation, this seems like a major flaw in interpretability of the results.

nkurz|7 years ago

Now that edit window is over, I finally noticed the massive error in my wording in the second to last sentence. Instead, let's pretend I wrote "why would we want Id_zi to equal zero when the population is less than the threshold?" It would also be good to note again that I haven't read the paper closely, and very well might be misinterpreting what it is doing.

---

But since I'm still in this edit window, I'll add an update here. I downloaded the per zip population data from here: https://blog.splitwise.com/2013/09/18/the-2010-us-census-pop.... Then I wrote a quick Perl program (parallel to Cook's Python simulation) but using the actual per zipcode populations rather than a fixed average. After confirming Cook's 84% number with the fixed population, I ran it on the actual populations (but still with a flat distribution for age and sex) and got 63% uniques.

Presumably this number would drop somewhat further with actual age distributions, but I don't know how far exactly. My current belief is that Sweeney's paper does a good job of calling attention to the fact that risk of identification is high, but the methodology and exact numbers should not be trusted. The actual percentage of Americans identifiable by (zip, dob, sex) is large, but something less than 63%. It might be interesting to run the simulation with actual age bracket data, but I didn't find this in any easy to download format, so I think I'll stop here.