Probabilistic Data Structure Showdown: Cuckoo Filters vs. Bloom Filters

[+] kibwen|9 years ago|reply

  > Since it’s possible to hash multiple items to the same 
  > indices, a membership test returns either false or maybe.

As someone who perpetually has trouble keeping the terminology of {false|no false} {positives|negatives} straight, I just wanted to remark that this is a very intuitive and straightforward way of explaining the guarantees of lookup in a Bloom filter.

[+] dexwiz|9 years ago|reply

True/False says if the test was correct. Positive/Negative is if the test passed.

True Positive: Test was correctly passed, hit. False Positive: Test was incorrectly passed, false alarm. True Negative: Test was correctly failed, reject. False Negative: Test was incorrectly failed, miss.

I find the terms to be too technical, so I remember a common situation. Medical tests are often False Positives. Which means the test was positive, but the test was wrong and you are okay (good!). From there is pretty easy to infer the rest.

Infact, that situation leads to a large issue with medical testing at a whole. If you were tested for 100 terrible diseases that you did not have, you would probably get a bunch of False Positives. Medical tests are set up for a high False Positive rate but low False Negative rate, because its less damaging to confirm a diagnosis with follow up testing than to miss a test. But confirming False Positive tests is expensive, along with the initial testing. So doctors try to limit tests to at-risk groups to spend less time on healthy patients' False Positive tests.

[+] jakewins|9 years ago|reply

As an aside, this is how null is defined in many languages as well. Many behaviors of SQL, for instance, are much more intuitive if you remember that SQL sees null as "unknown".

If a boolean value is null in SQL, it conceptually means it could be either true or false - it's a maybe. If you ask "is maybe true equal to maybe true", the answer is "maybe". Likewise, asking SQL 'null = null", it answers "null", instead of true/false.

[+] gopalv|9 years ago|reply

It's a much better notation because NOT(MAYBE) = MAYBE.

That's hard to explain when you work with binary booleans.

[+] mgunlogson|9 years ago|reply

I try to think of these filters as the opposite of a cache. Bloom/Cuckoo has no false negatives and a cache has no false positives. Besides this they are very similar conceptually and used in the similar contexts.

[+] saidajigumi|9 years ago|reply

For those who are new to Bloom filters, the article doesn't quite finish connecting the dots to a common use case:

Bloom filters give no false negatives but a controllable rate of false positives. If a Bloom filter indicates that an item has not been seen before, we can be certain that’s the case; but if it indicates an item has been seen, it’s possible that’s not the case (a false positive).

This style of probabilistic data structure can be used to achieve a significant speedup when the full test for set membership is very expensive. For example, run the test using the Bloom filter. A negative result can be taken as truth, while only positive results will have to be tested against the expensive algorithm to weed out false positives. Obviously, the expectation should be that negative results are far more common than positive results.

[+] lilyball|9 years ago|reply

As a practical example of this, consider searching a corpus of texts for a phrase. You can build a bloom filter from each text ahead of time, using each individual word as the elements to put in the filter. Then when searching, you can split the search phrase into words and check if all the words occur in the bloom filters. If any word does not occur, then you know the phrase won't match that text. But if the bloom filter can't rule out any of the words, then you can do the more expensive full-text search. This means you can search the whole corpus much quicker, as you can skip any text that doesn't match the words.

[+] manish_gill|9 years ago|reply

The Cuckoo Hashing algorithm is quite fun. I've been reading quite a bit about probabilistic data structures like Bloom Filters, Cuckoo Filters, HLLs etc lately. Here's a paper where they're empirically compared to bloom filters of various types: https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf

> A single HyperLogLog++ structure, for example, can count up to 7.9 billion unique items using 2.56KB of memory with only a 1.65% error rate.

Incidentally, Redis has an implementation for HyperLogLogs that has an even smaller error rate of 0.83%, though it uses more memory if I recall correctly. :)

[+] pselbert|9 years ago|reply

The HyperLogLog implementation in Redis is extremely well documented and amazingly quick. It drops down to non-probabilistic counting for smaller sample sizes, providing 100% accuracy with comparable speeds. As a fun exercise, which I wrote about [0], I estimated the number of unique words throughout the major works of Charles Dickens. Side note, that man had an amazing hairdo [1].

[0] https://blog.codeship.com/counting-distinct-values-with-hype... [1] https://upload.wikimedia.org/wikipedia/commons/a/aa/Dickens_...

[+] mgunlogson|9 years ago|reply

Just so nobody is confused, HyperLogLog etc. are used for a different operation than Bloom/Cuckoo filters.

HLL is used to determine how many unique items you have, the cardinality of a set.

Bloom/Cuckoo is used to determine set membership, as in "have I seen this item before".

Not sure how much that helps but the distinction was lost on me when I started reading about them.

[+] greglindahl|9 years ago|reply

HyperLogLog can trade space for accuracy. One annoying thing about some implementations is that they don't give the programmer access to that trade-off.

As an example, a search engine might want to have several HyperLogLog counts for every URL on the Internet. In that case, a 64-bit count may be a good choice over 12 kilobytes.

[+] mgunlogson|9 years ago|reply

Hi there, I've wrestled with Cuckoo filters myself, so I took a look at your lib and there's a some things that stood out.

Fingerprint size- It allows fingerprints that are too short, basically less than 4 bits doesn't allow a reasonable fill capacity. The paper authors only hinted at this, but check out the fill capacity graphs on page 6. This could be why your inserts are slowing down around 80% level when in my experience it doesn't happen till around 93%.

Modulo bias - Your filter capacity code doesn't seem to round the filter size to a power of two. This is a simple fix, but without it your array indexes will be skewed by modulo bias, possibly badly if someone picks a nasty number.

Alternate bucket positions - Your code seems to do a full hash for calculating alternate bucket positions. I know the paper mentions this, but I haven't seen anyone actually doing it :). Its a lot faster to just XOR with a mixing constant. TBH that's what most libraries are doing... whether it's a good idea is debatable :).

Fingerprint zero corner case - I'm not that great at python, but I didn't see any special handling for the rare case that the hash fingerprint is zero. In most implementations zero means empty bucket, so this could violate the "no false negatives" aspect of the filter by making items rarely disappear when they were supposed to be inserted. Most implementations either just add one to it, but I prefer to re-hash with a salt.

No victim cache - Didn't look too much into it, but I didn't see a victim slot used in your code. This will cause problems when the first insert fails. The problem is, by the time the first insert actually fails, you've relocated a bunch of different fingerprints like 500 times. It becomes unclear which fingerprint you originally tried to insert, and you're left holding a dangling item from a random place in the filter that you cannot insert. This violates the "no false negatives" mantra. Even though the filter is full it shouldn't break by deleting a random item when the first insert fails. You either need to store this last item or figure out a way to unwind your insert attempts to be able to reject the item that originally failed to insert.

Check out my Java library if you want to see how I dealt with these things. Also I have a bunch of unit tests there that I either came up with or borrowed from other Cuckoo libs. Should be pretty easy to convert some of those to python :) .

https://github.com/MGunlogson/CuckooFilter4J

Cheers, Mark

[+] adebayoj|9 years ago|reply

Hi Mark,

One of the authors here. Great points especially on the "No victim cache" aka insertion failure. This was mostly a test implementation for us to see how cuckoo filters work, but agree that it is critical to have these issues fixed.

I'll check out your library for improvements.

thanks!

[+] ScottBurson|9 years ago|reply

> With the Cuckoo filter, we notice an insertion throughput increase of up to 85 percent as it fills up to 80 percent capacity

That's not a throughput increase, that's a throughput decrease. Throughput is operations per unit time, not time per operation.

[+] mgunlogson|9 years ago|reply

Hmmm.... this might be an artifact of the python implementation?

The insert speed should be roughly constant until the filter begins moving items into their alternate buckets(when one of the two buckets for that item is full). In my implementation this doesn't usually happen until well over 90% capacity.

Insert speed will slow slightly as the filters fills up if the implementation uses a shortcut to determine the first empty bucket position by just checking if it's zero. Conceptually, if a bucket is empty you only need to check the first position before the insert. When the filter is half full you need to check ~%50 of bucket positions before an insert. In practice the bucket operations are so fast and buckets only hold 4 items each, I haven't noticed a difference.

[+] taeric|9 years ago|reply

I maintain that if you want to think of how a bloom filter works, think of it as the front desk staff at an office building. You can ask a series of questions such as "Has anyone wearing a hat come in?" "Any exceptionally tall people?" etc... It won't help answer if Peter from accounting is there, but can help give a good indication if you know enough about him today.

Obviously, if this is not a correct way to think of it, I'm open for more correct ways.

[+] Retric|9 years ago|reply

Not really, it's more like a guest list with only the first two letters of guests last names.

If someone shows you might learn they are not a guest, but having the 'correct' "La" letters does not mean they are a guest.

The advantage is a doorman can have a tiny list to direct people inside, the downside is it's poor validation and does not scale very far.

PS: You can obviously add more letters, but the same problem happens if two people have the same names.

[+] danijar|9 years ago|reply

That's a beautifully simple analogy.

[+] drblast|9 years ago|reply

Imagine a security guard who can't list from memory everyone who is in the building, but if you show him a picture of Peter from accounting, he can with certainty tell you that he hasn't seen Peter yet.

Not sure a bloom filter needs an analogy but there you go.

[+] blauditore|9 years ago|reply

There are two things I don't fully understand:

1. If an item in the Cuckoo filter needs to be moved, how does one know its other hash/location if only a fingerprint of the original item is stored (i.e. it can't be hashed again)?

2. [From the linked pdf] "Insertions are amortized O(1) with reasonably high probability." In case of a rehash, every item needs to be hashed and inserted again. This seems very expensive and impractical for data on Twitter scale, even if it only happens seldomly. Or are there any workarounds to mitigate this?

[+] thenewwazoo|9 years ago|reply

To 1, they answer this in the paper; one knows it because

The xor operation in Eq. (1) ensures an important property: h1(x) can also be calculated from h2(x) and the fingerprint using the same formula. In other words, to displace a key originally in bucket i (no matter if i is h1(x) or h2(x)), we directly calculate its alternate bucket j from the current bucket index i and the fingerprint stored in this bucket by

    j = i ⊕ hash(fingerprint).

[+] mgunlogson|9 years ago|reply

The thing with Cuckoo filters is that the item's alternate position is determined solely from it's current position and fingerprint.

Mathematically this is done by XOR'ing with a magic mixing constant borrowed from Murmur2 :) (or in my lib from Mumur3). The quality of the randomness of alternate bucket positions is...probably bad, but in practice it seems to work fine.

Because of this finding alternate positions doesn't require a rehash per se, just a few CPU instructions.

However, there IS a corner case that requires re-hash that comes up more frequently the shorter the fingerprints used. Conceptually a tag/fingerprint of zero means empty bucket, and occasionally the fingerprint hash for an item will end up being zero. In this case I've seen some libraries just add 1, which I didn't really like since it increases hash collisions slightly. Or you can do it like my lib does and just re-hash the item with a salt.

[+] grogers|9 years ago|reply

Does anybody have an example of an application where you need the counting/deletion properties of a counting bloom filter or cuckoo filter over a normal bloom filter? If you do, how do you deal with the filter rejecting your writes because a bucket is full?

It seems like for most applications, silently degrading (instead of rejecting the insertion) when the bloom filter is above capacity is a super useful property.

[+] throwbsidbdk|9 years ago|reply

The failure mode doesn't matter much in practice. You need to track how many inserts you've done on both for practical reasons so with either you'll have a set cutoff before rebuild. Cuckoo filters are more likely to be used in place of counting bloom filters than vanilla ones.

You could always avoid duplicate inserts in cuckoo by checking contains before calling insert again. A modified insert-only-once routine would only have a small performance penalty. You can't use counting or deletion while doing this though, so its a trade-off. This same trade-off happens with counting bloom filters but they are much less space efficient.

Practically the use case for cuckoo filters over bloom probably lies in bi directional communication. Partner nodes can keep each other's state updated without needing to exchange a ton of data. Think distributed caches. So two data nodes exchange cuckoo filters of each other's data initially. As things fall out of their caches they can tell each other to delete those items from each other's filters by sending only the bucket index and fingerprint. Probably much smaller than the piece of data that represented originally. Since each data node independently knows what was in their cache there's no risk of false deletions. You can't really use bloom filters for this because you can't delete

[+] pklausler|9 years ago|reply

If a writer doesn't know the difference between 'unique' and 'distinct', suspect their mathematics.

[+] justicezyx|9 years ago|reply

http://math.stackexchange.com/questions/24119/difference-bet...

It seems the article's use of unique does not matter much in the context. The level of math explanation seems an over-kill.

[+] neeleshs|9 years ago|reply

What's the difference?

[+] stonewhite|9 years ago|reply

The election memes apparently really got to me, I was confused to see the headline 'Cuck filter' for a moment in HN frontpage.

44 comments