(no title)
FRex | 7 years ago
For example: let's assume all passwords (including all of the ones on haveibeenpwned list) are exactly 8 chars long and composed of 50 characters.
This gives a lot of headstart to this filter since the way possible passwords outnumber ones on the list is extremely more crushing in reality due to many possible lengths, unbounded length and there being 95 printable ASCII chars to use (even alnums themselves is 62 chars already).
passlen = 8
passchars = 50
badpasses = 501636842
allpasses = passchars ** passlen
falsepos = 0.001
print(badpasses / (badpasses + falsepos * allpasses))
The above script prints 0.012679079642336052, which means (in these extremely forgiving scenario from above where possible passwords are limited to next to nothing compared to real life) only just over one in a hundred positives is a true positive.This apparently comes up with FP rates for cancer and AV and such (most people dont' have cancer, most exes aren't viruses, etc.).
I might be wrong, feel free to point that out. I'm not a statistician or a doctor, I just had a statistics class as part of my CS degree. :)
Edit: I've found the English terms and explanations:
bo1024|7 years ago
But actually, probably more like 1 in 10 or 1 in 50 users use bad passwords. If we use 1 in 50, we get that 19/20 positives are true positives and only 1/20 are false positives. With 100,000 users we'd expect to trip the filter 2,000 times, only 100 of which are false positives.
vertere|7 years ago
kolpa|7 years ago
The number of false positives don't matter, and getting a false positive matters nearly not at all. 1 in 1000 good passwords are rejected. Good passwords are nearly random, and cheap to generate. That means that average person needs to re-pick a password once every, what, 100 years?
thomasahle|7 years ago
"But it's much more likely than running into a true positive" you might say.
Actually not, since the user runs into a true positive exactly when they choose a very common password. It's not random. If they only choose bad passwords they'll run into true positives evey single time.
On the other hand, that there exist more false positives than true ones is actually great, since this means you can't recreate the true set of passwords given the Bloom filter.
stochastic_monk|7 years ago
That being said, false positives will hopefully outnumber true positives, since not many people should use one of them.
On the use of a bloom, I would instead use a cuckoo filter. Then, since its error rate plateaus when it's close to full, you could get to the same error rate with a significantly smaller filter.
CapacitorSet|7 years ago