Hey, I'm the author of this blog. Much of my previous deanonymization research has been discussed on HN; see http://www.google.com/search?q=33bits.orgsite:news.ycombinat... Also, if you find the premise of the blog interesting check out the sitemap linked from the page.
But since this post is about the About page, let me share a couple of lessons I've learned from the blog, which has been more successful in communicating my research than I'd dared to hope for when I started it 3.5 years ago.
1. Those of us working on technical areas often struggle to explain our ideas to others not as technical, in a way that avoids oversimplification and losing essential meaning. Sometimes you'll discover an analogy or metaphor or phrase that does both. Seize those chances, they're powerful.
2. Coming up with a name is more important than you might think. If a good name will make your idea or product even 5% stickier, it follows that it may be worthwhile to spend 5% of your time just coming up with the name. One way to do it is to be constantly on the lookout for a good name while you're working on the product.
3. If you're writing about something that has policy implications, and want it to be read in Washington, it's hard but not impossible. Two important requirements are to network and build up an audience — they aren't going to read your blog just because it ranks high in Google searches — and to use language that non-technical people can understand.
De-identification of medical charts is a bottleneck in clinical research. It's impractical to ask for thousands consent forms, however, smaller sample sizes are inconclusive, so much, that most of medicine is driven by inconclusive research findings. Moreover, full anonymization does not allow to follow patient records over time. This will kill any big patient outcome study, at least financially. What are your thoughts?
Hey! I ran into your blog after I saw an announcement for (one of?) your talk(s) next week.
I submitted the about page because the two key claims that you make: (1) you only need a few bits of information to identify a person uniquely in the whole world and (2) this information is becoming easier and easier to obtain - both make a lot of sense to me. Your about page does an excellent job of communication these two points and I thought it might interesting food for thought for HN.
I'm wondering whether much as we'd all like to have privacy and anonymity, these could be goals that might be impossible to achieve in the future. I'd like to hear what your thoughts are on where, we as a society are heading in this context and whether it's unrealistic to expect that conventional expectations of privacy will continue to be fulfilled in the future. Perhaps, we should accept that the privacy battle is lost and try to other solutions to the problems that privacy was solving?
I've been trying to think of a way of using typing cadence to capture "bits" of information. Think you would have a good solution for that?
Take this scenario (and also check out the exclaimer at the bottom!).
All users on the earth type the same paragraph, or perhaps some password (clearly the longer the most distinct fingerprint, but bear with me on this).
Based on this sequence of keypresses, I capture the timestamp that each keypress is activated, and then the duration that each key is held down for.
Based on this information, how would you recommend, or suggest that a person goes about detecting some unique fingerprint from these values.
I was thinking the best way would be to have each keypress some space point and the duration held down a vector. And then if each use is entering the same paragraph, the distance accross all of the vectors could be used to calculate some identifying fingerprint.
Exclaimer: I'm absolutely not interested in the slightest in tracking users. Every weekend I try and research something that interests me. Last weekend was user fingerprinting based on the typing speed and cadence of users.
> If a good name will make your idea or product even 5% stickier, it follows that it may be worthwhile to spend 5% of your time just coming up with the name.
I'm a sucker for a great name, but this is misleading. If something will give you an x% better outcome, it doesn't follow that you should spend about x% of your resources on it.
It depends entirely on the opportunity costs, yeah? You should spend time on your name only when you believe that thinking about names for another hour will do more good than coding or talking to customers for another hour.
>1.. Those of us working on technical areas often struggle to explain our ideas to others not as technical, in a way that avoids oversimplification and losing essential meaning. Sometimes you'll discover an analogy or metaphor or phrase that does both. Seize those chances, they're powerful.
Am sorry does both of what?.. Don't mean to nitpick, but this is a problem i run into regularly and don't seem to be able to find a reliable approach. So just trying to divide it into the factors involved...
I understand the Log2 concept of people able to narrow down something via binary search, but I have a question: don't the "facts" about a person have to divide the remaining population in half (or into smaller chunks)?
For instance if you know "Frank" doesn't wear a Rolex, that would not rule out very many people. So statistically, it would probably be better to know if Frank has red hair, as that could rule out a lot more people.
Also, let's say you have it narrowed down to four people, but the last bit of information is common to all of them. You now have to get another bit, and possibly another, correct?
EDIT: Felt like I didn't express my main point well enough: while you can certainly narrow down people with "bits" of information, information is most of the time not just 1 or 0 and can be fuzzy (or too common) to be useful in a binary search, although with the right bits of information it can of course be fruitful.
I'm really interested by this concept and also curious as to if anyone is employing it on a mass scale.
"Also, let's say you have it narrowed down to four people, but the last bit of information is common to all of them."
The definition of a bit is something which removes half the possibilities. If you have 4 people and acquire a "bit" of information that breaks them into two categories, one with 4 people and one with 0 people, you, by definition, in fact have 0 bits.
Fractional bits are not only possible, they are by far the common case. With a lg2 in the definition of the bit, it's pretty uncommon to have integral bits.
Critical insight: What we call a "bit" in a computer and a "bit" in information theory are related but not the same thing. You can't have a fraction of a bit stored in your computer's RAM, the words are meaningless. It is best to simply flush your idea of what you think a bit is and start over again from scratch when studying information theory, then when you are comfortable with it the connections will become obvious. Starting from the RAM side is actively harmful.
I think the premise is false. You would need about 33 _unique_ bits. I doubt that you can prove the existence of a person-independent algorithm to gather these.
See <strike>comment #12</strike> the comment posted at February 12, 2010 at 5:15 am in the blog post.* The term entropy refers to uniqueness.
As for the development of algorithms to gather those bits, that's what my entire Ph.D. is about and what my blog is mostly about. This is what I've been proving for the last 6 years.
*Just realized comment numbers are unstable. Bad wordpress.
If you want to be very precise, you need evidence which causes at least a 33 bit reduction in entropy between your prior and posterior estimates of the probability of each person in the world being the one you're looking for.
I agree the premise is false also, but it doesn't need to be 33 unique bits either. The combination of the bits has to be unique and that is not something easily provable unless you know the entire dataset, so the premise is just kind of irrelevant to reality I think. They do this same kind of thing on crime shows all the time. He has a blue truck and a mustache. How many people with that description live in lower queens? "13 sir."
Just out of curoisity, how many bits would it take to include all the people that have ever lived? Also, how many to realistically cover the the future?
37 for everyone who's ever lived (+30% more people)... and for the future? I dunno, how long do you want to go? 2x...
However, 33 bits is a simplification. You can express 8.5 billion different values with 33 bits, but unless those 33 values map to well-distributed discriminators, the number is meaningless...
How can anyone possibly ever answer that question?
People have been disputing how many people have ever lived on the earth for decades. Some anthropologists have thrown out numbers as high as 70billion-120billion, although other scientists have admittedly said the number is probably around 7-10B.
How many people to calculate how many people have ever lived?
That's just a log2( X ) number of bits would be needed.
As the author of the blog says, 6000 billion people could be written in just 43 bits.
The Anthropic Doomsday Argument says that, since it would take 37 bits to include all the people that have lived so far, it takes about 38 bits to include all the people that will ever live. Many people find this somewhat disconcerting.
> There are only 6.6 billion people in the world, so you only need 33 bits (more precisely, 32.6 bits) of information about a person to determine who they are.
I think you should count the dead as well. But then, 33 bits ~= 8 billion, which should still be enough, I guess.
[+] [-] randomwalker|14 years ago|reply
But since this post is about the About page, let me share a couple of lessons I've learned from the blog, which has been more successful in communicating my research than I'd dared to hope for when I started it 3.5 years ago.
1. Those of us working on technical areas often struggle to explain our ideas to others not as technical, in a way that avoids oversimplification and losing essential meaning. Sometimes you'll discover an analogy or metaphor or phrase that does both. Seize those chances, they're powerful.
2. Coming up with a name is more important than you might think. If a good name will make your idea or product even 5% stickier, it follows that it may be worthwhile to spend 5% of your time just coming up with the name. One way to do it is to be constantly on the lookout for a good name while you're working on the product.
3. If you're writing about something that has policy implications, and want it to be read in Washington, it's hard but not impossible. Two important requirements are to network and build up an audience — they aren't going to read your blog just because it ranks high in Google searches — and to use language that non-technical people can understand.
Happy to answer any questions!
[+] [-] zeratul|14 years ago|reply
https://www.i2b2.org/NLP/DataSets/Main.php
De-identification of medical charts is a bottleneck in clinical research. It's impractical to ask for thousands consent forms, however, smaller sample sizes are inconclusive, so much, that most of medicine is driven by inconclusive research findings. Moreover, full anonymization does not allow to follow patient records over time. This will kill any big patient outcome study, at least financially. What are your thoughts?
[+] [-] microarchitect|14 years ago|reply
I submitted the about page because the two key claims that you make: (1) you only need a few bits of information to identify a person uniquely in the whole world and (2) this information is becoming easier and easier to obtain - both make a lot of sense to me. Your about page does an excellent job of communication these two points and I thought it might interesting food for thought for HN.
I'm wondering whether much as we'd all like to have privacy and anonymity, these could be goals that might be impossible to achieve in the future. I'd like to hear what your thoughts are on where, we as a society are heading in this context and whether it's unrealistic to expect that conventional expectations of privacy will continue to be fulfilled in the future. Perhaps, we should accept that the privacy battle is lost and try to other solutions to the problems that privacy was solving?
[+] [-] gwern|14 years ago|reply
[+] [-] chrisacky|14 years ago|reply
Take this scenario (and also check out the exclaimer at the bottom!).
All users on the earth type the same paragraph, or perhaps some password (clearly the longer the most distinct fingerprint, but bear with me on this).
Based on this sequence of keypresses, I capture the timestamp that each keypress is activated, and then the duration that each key is held down for.
Based on this information, how would you recommend, or suggest that a person goes about detecting some unique fingerprint from these values.
I was thinking the best way would be to have each keypress some space point and the duration held down a vector. And then if each use is entering the same paragraph, the distance accross all of the vectors could be used to calculate some identifying fingerprint.
Exclaimer: I'm absolutely not interested in the slightest in tracking users. Every weekend I try and research something that interests me. Last weekend was user fingerprinting based on the typing speed and cadence of users.
[+] [-] geoffschmidt|14 years ago|reply
I'm a sucker for a great name, but this is misleading. If something will give you an x% better outcome, it doesn't follow that you should spend about x% of your resources on it.
It depends entirely on the opportunity costs, yeah? You should spend time on your name only when you believe that thinking about names for another hour will do more good than coding or talking to customers for another hour.
[+] [-] aangjie|14 years ago|reply
Am sorry does both of what?.. Don't mean to nitpick, but this is a problem i run into regularly and don't seem to be able to find a reliable approach. So just trying to divide it into the factors involved...
[+] [-] orthecreedence|14 years ago|reply
For instance if you know "Frank" doesn't wear a Rolex, that would not rule out very many people. So statistically, it would probably be better to know if Frank has red hair, as that could rule out a lot more people.
Also, let's say you have it narrowed down to four people, but the last bit of information is common to all of them. You now have to get another bit, and possibly another, correct?
EDIT: Felt like I didn't express my main point well enough: while you can certainly narrow down people with "bits" of information, information is most of the time not just 1 or 0 and can be fuzzy (or too common) to be useful in a binary search, although with the right bits of information it can of course be fruitful.
I'm really interested by this concept and also curious as to if anyone is employing it on a mass scale.
[+] [-] jerf|14 years ago|reply
The definition of a bit is something which removes half the possibilities. If you have 4 people and acquire a "bit" of information that breaks them into two categories, one with 4 people and one with 0 people, you, by definition, in fact have 0 bits.
Fractional bits are not only possible, they are by far the common case. With a lg2 in the definition of the bit, it's pretty uncommon to have integral bits.
Critical insight: What we call a "bit" in a computer and a "bit" in information theory are related but not the same thing. You can't have a fraction of a bit stored in your computer's RAM, the words are meaningless. It is best to simply flush your idea of what you think a bit is and start over again from scratch when studying information theory, then when you are comfortable with it the connections will become obvious. Starting from the RAM side is actively harmful.
[+] [-] blake8086|14 years ago|reply
[+] [-] unknown|14 years ago|reply
[deleted]
[+] [-] twiceaday|14 years ago|reply
[+] [-] randomwalker|14 years ago|reply
As for the development of algorithms to gather those bits, that's what my entire Ph.D. is about and what my blog is mostly about. This is what I've been proving for the last 6 years.
*Just realized comment numbers are unstable. Bad wordpress.
[+] [-] pjscott|14 years ago|reply
http://en.wikipedia.org/wiki/Entropy_(information_theory)
As the stuff posted on 33bits regularly demonstrates, it is surprisingly easy to get this much information for a whole lot of people.
[+] [-] dangero|14 years ago|reply
[+] [-] funkah|14 years ago|reply
I suggest you peruse it.
[+] [-] TamDenholm|14 years ago|reply
[+] [-] andreasklinger|14 years ago|reply
According to… http://www.wolframalpha.com/input/?i=how+many+people+have+li... http://www.wolframalpha.com/input/?i=106+billion+in+binary
37 bits
[+] [-] peteretep|14 years ago|reply
However, 33 bits is a simplification. You can express 8.5 billion different values with 33 bits, but unless those 33 values map to well-distributed discriminators, the number is meaningless...
[+] [-] chrisacky|14 years ago|reply
People have been disputing how many people have ever lived on the earth for decades. Some anthropologists have thrown out numbers as high as 70billion-120billion, although other scientists have admittedly said the number is probably around 7-10B.
How many people to calculate how many people have ever lived?
That's just a log2( X ) number of bits would be needed. As the author of the blog says, 6000 billion people could be written in just 43 bits.
[+] [-] djbender|14 years ago|reply
[+] [-] khafra|14 years ago|reply
[+] [-] napierzaza|14 years ago|reply
[deleted]
[+] [-] jmatt|14 years ago|reply
Birthday, Gender and Zipcode is enough to identify someone uniquely approximately 85% of the time.
And a quickly googled source but the meme is older than that: http://godplaysdice.blogspot.com/2009/12/uniquely-identifyin...
[+] [-] reedlaw|14 years ago|reply
[+] [-] plq|14 years ago|reply
I think you should count the dead as well. But then, 33 bits ~= 8 billion, which should still be enough, I guess.