top | item 43770758

(no title)

qw | 10 months ago

> The client then searches through the list to see if the desired email is in the list or not.

The initial prefix check would probably reduce the amount of lookups necessary, as it would only be necessary to do a deeper search if the prefix matches.

discuss

Thorrez|10 months ago

>only be necessary to do a deeper search if the prefix matches

There are 5 billion emails in at least 1 breach and 16 million prefixes. Almost all if not all prefixes have at least 1 email in a breach. So almost all prefixes match. I don't see why it's useful to spend a bunch of effort optimizing the very rare case of a prefix not matching.

Now, if the bloom filter checked emails instead of checking prefixes, that would be useful. However, a bloom filter of 5B elements with a 10% false positive rate would be 2.8 GB, which is prohibitively large.

https://hur.st/bloomfilter/?n=5g&p=10&m=&k=

smallpipe|10 months ago

Yeah that was my point, you can get rid of a significant portion of requests at the edge with a bloom filter, and there's no reason you have to build the bloom filter locally as requests come in. Instead, it can be created ahead of time, when the dataset is updated.

Thorrez|10 months ago

See my reply at https://news.ycombinator.com/item?id=43780713 .

Also regarding "you can get rid of a significant portion of requests at the edge with a bloom filter", Troy's existing design already gets rid of a significant portion of requests at the edge. That's why he says

>The response from each search was coming back so quickly that the user wasn’t sure if it was legitimately checking subsequent addresses they entered or if there was a glitch.