The Difficulty Of Private Contact Discovery

[+] MagicWishMonkey|12 years ago|reply

This is great until you realize the average address book is a disgusting mess of spelling errors, wrong values in wrong fields, punctuation in places where punctuation isn't necessary (commas in phone number fields, digits in name fields, etc.)

The only field that might be fairly consistent is email address and even that is no guarantee. Sure, it's possible to clean up a contact before hashing, but when you consider the fact that the typical phone contains >500 contacts it's not something that can be done in a reasonable amount of time on a modern smartphone. For the last two years I've been working on a product that involves cleaning/organizing contacts and making them searchable, when I started I had no idea what a massive undertaking it would be.

So, hash till your hearts content, but if you have 10 different people with the same contact in their address book I would be willing to bet that you will get 10 different hashes because humans are not good at data entry.

[+] danielrhodes|12 years ago|reply

This times a million.

Phone numbers in particular can be a horrid mess to deal with. Some decent libraries have come out to normalize them, but in general it remains difficult.

For example:

013811234 +88-1-3811234 3811234 0118813811234 008813811234

are all be the same number.

To hash them, you could naively just strip off the area codes and country codes, but that is only if you can know ahead of time they are indeed area or country codes. Some countries have 7 digit numbers, others have 9, some have 6, or even 5 digits. Sometimes you don't even know the country to apply these rules to.

[+] jokull|12 years ago|reply

Where can I find your app?

[+] ww520|12 years ago|reply

There similarity hash algorithms that takes a fuzzy approach to hashing slightly different values to the same hash value.

[+] sigil|12 years ago|reply

Interesting problem!

> It’s also possible to compress “updates” to the bloom filter. The server just needs to calculate the XOR of the version the client has and the updated version, then run that through LZMA (the input will be mostly zeros), and transmit the compressed diff to the client.

> Unfortunately, for a reasonably large user base, this strategy doesn’t work because the bloom filters themselves are too large to transmit to mobile clients. For 10 million TextSecure users, a bloom filter like this would be ~40MB, requested from the server 116 times a second if every client refreshes it once a day.

Wait, the LZMA-compressed daily diffs would be still be ~40MB? If the 40MB is just the initial bloom filter download, that's not so bad. If 40MB is the daily diff, I'd be interested in seeing the calculation for that.

[+] moxie|12 years ago|reply

Our feeling is that 40MB for an initial download is not acceptable on mobile devices, particularly in many of the areas we'd like to support. Maybe we're wrong, but it seems like too much on edge networks, particularly given that it's only going to grow as more users join. Ideally we'd have 100MM or 500MM, not just 10MM.

We can definitely make lightweight diffs happen, but that initial download is tough, and is going to get tougher.

[+] mtrimpe|12 years ago|reply

Indeed; if I understand it correctly encrypted data goes into a standard bloom filter.

If that's indeed the case then, assuming that users can never be removed, a daily update would just contain the indexes of the additional bits set by the newly registered users, which could never be 40MB.

It seems they also forgot to include lowering the p-value of the bloom filter as a trade-off, since it seems that a false positive means that a contact is sent to the server which shouldn't have been.

If you push that down to p=.95/.98/.99 or so you'd get a much smaller Bloom filter and many people might not object to a few % of 'collateral uploads' so to speak.

Minimizing the amount of information leaked through these various tradeoffs is an interesting problem though, since you'll have to leak something one way or the other.

[+] pbsd|12 years ago|reply

A new private-set intersection paper, claiming to scale orders of magnitude better than predecessors, has just been released [1]. I don't know whether it solves your problem, as I haven't read anything beyond the abstract, but you might want to check it out anyway.

[1] https://research.microsoft.com/en-us/um/people/senyk/pubs/sa...

[+] tlb|12 years ago|reply

The problem is that it's easy to upload 1000 randomly generated phone numbers and get back results. But I think you can distinguish random or sequentially generated numbers from a real user's actual contact list, by looking at the distribution of friends among the contacts. A real contact list should have a high density of friendships between contacts. The system should only return results if the uploader provides a subgraph that's more highly connected than a random one would be.

You'd need to look at real data and tune the parameters to make this effective.

[+] spullara|12 years ago|reply

I think there is a big difference between private contact discovery and contact discovery that isn't brute forceable like we are seeing with SnapChat and others. One change that could be made for these very small pre image corpuses would be to ask the target user if they should be exposed to the searcher.

1) I upload the hashes of my contacts which enqueues a request to those that match the hashes

2) The target of the request dequeues those requests and is given the option to expose themselves to the other user

3) If approved, the original uploader gets access to that user on the service

This would stop the massive discovery issue and would only put a medium amount of friction into the growth engine. It obviously doesn't cover the issue that the service itself has access to the contacts which means they could potentially be exposed. However, if the service discards hashes without matching queues and removes matches when dequeued, the risk to the user is much reduced from the full address book storage case.

This was written off the top of my head, so forgive me if I have left a big hole somewhere.

[+] herge|12 years ago|reply

RTFA. He addresses that in the first few paragraphs. Because there are 10^10 possible phone numbers (approximately), it is very easy to build a rainbow table of all hashed phone numbers and back-track a hash to it's phone number.

[+] jcampbell1|12 years ago|reply

I understand how the Encrypted Bloom Filter they describe keeps my contacts private from the service, but it seems to leak the entire contact list the same as the trivial solution:

I install millions of copies of the app, all with a different contact list. I do this signed query process for every possible phone number. I now have the complete list of users.

[+] moxie|12 years ago|reply

This post is about building untrusted servers, which is somewhat orthogonal to "the snapchat problem."

The best a privacy preserving protocol can hope to achieve is to match the characteristics of its non-privacy preserving equivalent.

Asymmetric PIR (the server transmits an unencrypted bloom filter) falls short of the non-privacy preserving equivalent, since the user makes offline queries that can't be rate limited by the server. Symmetric PIR (the server transmits an encrypted bloom filter) provides identical properties to the non-privacy preserving equivalent, since the user needs to make queries that can be rate limited by the server, even though the queries leak no information to the server.

How that rate limiting (or unlisted numbers, or whatever) works is outside the scope of the protocol, just as it is in the non-privacy preserving case. For TextSecure, you'd need to verify 1 million phone numbers in order to perform your attack. Whether you can do that or not isn't really the point: we simply want to do as least as well as everyone else, without revealing information to the server.

[+] robrenaud|12 years ago|reply

Is this a real problem?

Is it reasonable to trust for me to trust you to you run some arbitrary binary on my phone that does some magical black box computation on my contact list and then phone home to your service, but not trust to that you will just not be a dick with the raw data?

Stated more simply, if I am paranoid about my privacy, then why am I running your apps and letting you read my contact list at all?

[+] stormbrew|12 years ago|reply

Agreed. The whole thing seems to really miss the problem Snapchat exposed, turning an access control problem into a cryptographic problem. At least I assume this is being posted now because of snapchat.

The problem with how snapchat failed isn't so much that they weren't storing the information securely but that they had absolutely no gate on getting the information aside from a global api key. If you're running their app you've already exposed the phone numbers and are trusting them not to be too evil with them.

My expectation as a user of their service (I'm not, but if I were, and this applies to things like facebook as well, of which I am a user) is that they won't tell people I don't want to know who I am and if I'm using their service.

To this end what's needed is that before it gives another user a positive result (especially tied to my name) to my phone number it should have a reasonable expectation that I want them to have it. If they store the information that lets them derive this expectation in a way that keeps my info secure, that is also fantastic, but it's a second tier concern to becoming literally the whitepages by accident.

[+] sln|12 years ago|reply

Yes this is a real problem. Being "paranoid about privacy" is not a yes/no decision for people to make. Different people have different things they would like to be private about and to different degrees. Some people may not want it to be possible for anyone to ever know anything at all about them. Those people don't use phones of any nature. Maybe some other people only trust phones that they build themselves. Other people are willing to trust open source software and the communities around it. Some people only care about making dragnet surveillance difficult or expensive. Others need perfect assurance of privacy because their lives are in danger. Others don't care if anyone listens to their calls.

But what you are saying is like looking at football players, seeing that they could still get hurt when they wear lots of padding, and deciding either no one should play the game or we should all play inside giant airbags.

[+] abdullahkhalids|12 years ago|reply

They provide the source. Examine it and build it yourself.

[+] goatslacker|12 years ago|reply

To the end user probably not but if the encryption is performed on the client and the client code is open sourced and you don't auto-update your application then you'll know exactly what is happening to your data when you let it access your contacts.

[+] maxerickson|12 years ago|reply

In a similar discussion, I said something about hashing phone number pairs. The graph has a lot more edges than nodes.

Still doesn't stop someone from attacking a single number though.

[+] thatthatis|12 years ago|reply

I you treat a pair as a one-direction edge, and only match when you have an edge in both directions, then you could add text message verification to prevent people from both impersonating others and extracting additional information out of the service. It would be reasonably trivial then to notify clients when a new match comes in.

This approach seems to require that you trust the server.

[+] ash|12 years ago|reply

Interesting idea. But the address book "friends" must be mutual for this to work. You would miss those contacts that don't have you in the their contacts.

[+] pyrocat|12 years ago|reply

So don't do it. I know I'm probably in the minority here but I've never given another app permission to use my facebook or twitter account to "Find Friends" and I personally find the whole practice offensive. Just have an option to invite people by their email address/number/whatever, but not in bulk.

[+] ash|12 years ago|reply

You seem to dislike inviting a bulk of friends to a service. But the article doesn't talk about invites. The problem is securely discovering who's already invited. It's not offensive in my opinion.

[+] abentspoon|12 years ago|reply

What about this?

1) Client uploads a bloom filter with all contacts on phone

2) Server responds with a bloom filter with all registered contacts that match the client's bloom filter

3) Client displays contacts that match server's bloom filter

You can optionally trade contacts back and forth again with a larger bits/contact ratio to decrease false positives.

I think it works out so that in exchange for 7 bits of information about each contact from the client, you can reduce the server's response by a factor of 128.

[+] sigil|12 years ago|reply

The problem with this is that (1) + (2) means the server now knows the user's address book. They're trying to avoid that with encrypted bloom filters and blind signature queries.

[+] mtrimpe|12 years ago|reply

The biggest problem with that is:

1) Client uploads a bloom filter consisting entirely of ones.

2) Server responds by sending its entire contact database.

[+] ash|12 years ago|reply

How would you do (2) without iterating over all registered contacts on the server?

[+] Strilanc|12 years ago|reply

40 MiB for ten million users? I think you should ditch the bloom filter if the false-positive rate is set so low it takes that much space.

Assuming you can map phone numbers onto the integers from 0 to ten billion, trivial compression of the is-a-user/is-not-a-user bit sequence should easily fit in ~14MiB [1].

See also: succinct data structures [2] for accessing data without decompressing it.

(This is obviously a short term solution. Growth will overcome the savings.)

1: http://www.wolframalpha.com/input/?i=%28-p+lg_2%28p%29-%281-...

2: http://en.wikipedia.org/wiki/Succinct_data_structure

[+] fleitz|12 years ago|reply

It's not a technical problem it's a value prop problem. Most users don't care about privacy....

Fundamentally there is more value for the network operator to grow the network than for users to disclose their details. Email works perfectly fine without having to disclose your address book because no one owns the network and therefore do not care how many people are using email.

Alternatively one could just allow users to choose whether to make their info public and then the client can just search a directory.

[+] Illniyar|12 years ago|reply

What about artificially increasing the size of the hash input. For instance: when finding out if a a contact exists the client sends 10 iterations of the contact (I.E. the contact number with another digit, each 0-9) hashed, the server keeps only one random hashed one to compare to (presumably the client hashes one of them randomly when sending it's own contact information).

That way you increase the size of the problem from 10^10 to 10^11 but only increase your bandwidth 10 fold.

It still keeps the need to balance the amount of bandwidth used, but the bandwidth is no longer determined by the amount of all the users in the system, only the increase in problem size and the amount of contacts a person has.

Is that a worthwhile solution?

[+] cristiantincu|12 years ago|reply

> Building a social network is not easy. Social networks have value proportional to their size, so participants aren’t motivated to join new social networks which aren’t already large. It’s a paradox where if people haven’t already joined, people aren’t motivated to join.

Value is relative. For stakeholders, value is indeed given by the network’s size. For users (participants), it’s very debatable whether size equals value. The author seems to be mistaking the two perspectives for a single one. No paradox here.

[+] abdullahkhalids|12 years ago|reply

Not really. I will only want to join a social network if a lot of my friends are already on it. The fewer of my friends on it, the lower the value I get out of using the service (cetris paribus of course).

[+] abdullahkhalids|12 years ago|reply

Is is possible to leverage some sort of peer-assisted discovery? Where you ask some sub-set of your trusted contacts already on the system, to call the server on your behalf. So that the server is less likely to know whose contacts are being sent to it.

Of course, this can only be part of the full solution.

Disclaimer: just throwing a wild idea out there. don't know anything about the field.

[+] jahewson|12 years ago|reply

Could you use an enormously expensive hash function so that building a lookup table is infeasible?

[+] russellsprouts|12 years ago|reply

Assuming you want to try every phone number in the US, that would be almost 10 billion phone numbers. An above average user may have about 1,000 contacts. We want this to run on phones in a reasonable time for 1000 contacts, but be infeasible for 10 billion using any hardware the attacker uses.

Bitcoin gives an idea of the relative speeds. I looked up the speeds of some Android bitcoin miners. All were on the order of 10Khash/s.

The attacker is not constrained to use phones. Custom hardware for BitCoin mining can be on the order of 1Thash/s.

This means that the custom hardware can reach 10^8 times faster, but the work to be done -- 10 billion hashes vs. 1000, is only 10^7 times as much. If the slow hashing algorithm used was anything like the one for BitCoins, a determined attacker could calculate the entire table faster than the person looking up their contacts.

This is not the whole story -- this could be mitigated using better hashing algorithms that are designed to not scale well with better hardware, but the user will only wait maybe a minute, while an attacker could spend weeks.

[+] unknown|12 years ago|reply

[deleted]

[+] nwzpaperman|12 years ago|reply

[deleted]

[+] amix|12 years ago|reply

I don't really think you need to solve this problem using crypto. I think a simpler solution would be to use the "hash it" solution, but restrict how many lookups each client can do. You can either do this by the using the client's phone number, IP or other unique information etc. This way an attacker would have a very hard time brute forcing this.

Additionally you could use captchas (or other humanity tests, such as SMS verification) to limit hackers creating automated accounts (in order to fight automated bots spamming and polluting the network).

[+] cortesoft|12 years ago|reply

I think the point it to avoid having to trust the server. If you can trust the server not to do malicious things with the data, then you can do any number of techniques.

This post is about how to do contact lookup without having to trust the server to maintain privacy.

71 comments