bdupras's comments

bdupras | 8 years ago | on: Probabilistic Filters By Example

Yes, cuckoo filters and counting bloom filters can start rejecting insertions before the filter is loaded - cuckoos because of the inability to find an open slot, and counting blooms because of saturation of a counter. Note that this is only the case if deletion or counting support are required.

In some cases, like the when there's a high likelihood of duplicate entries, this can represent a catastrophic failure mode. This is a good point, and when I get around to updating the tutorial, I'll make note of this.

Regarding the failure mode, an insertion can be knowingly rejected without mutating the filter, so the filter remains useful for subsequent insertions and all reads, retaining all its guarantees. The only loss is the rejection of the item - in our case we used this as signal to rebuild the filter at a larger size.

bdupras | 8 years ago | on: Probabilistic Filters By Example

chronolitus did jetrink's comment answer your question? I'm happy to further clarify if needed.

bdupras | 8 years ago | on: Probabilistic Filters By Example

Yeah you're correct in saying that the demo is broken - sorry for the confusion, I stopped working on this when it was good enough for an internal talk at my company.

Re multisets, yes cuckoo filters and counting blooms are bounded - cuckoos by entries or bucket x 2, and blooms by bits per counter. Cuckoo's counting capability is more a side effect of the design. It's really only interesting in that you get limited counting in roughly the same bit space as a non-counting bloom. I suppose you could add counting bits to a cuckoo to support higher bounds in a similar manner to counting blooms.

I'm curious to hear more on your point on K-hashing in bloom filters - would you expand on your statement about only needing a K-bit hashing function? I'm happy to update the text on the tutorial to be more clear/exact.

bdupras | 8 years ago | on: Probabilistic Filters By Example

Yeah - pardon the crappy javascript and UX. Notice the colors on the cuckoo filter when you insert this sequence. When you try to insert `3g46stb6vy6vsyosyvosfsdfsdsah`, the fingerprint entries in the cuckoo turn red indicating an unsuccessful entry. All possible fingerprint slots in the cuckoo for that entry were already occupied, and the recursive rearranging of entries was unable to free up a slot.

This is a down-side to cuckoos (and counting blooms) - even when the filter has reasonable capacity remaining, an entry can be denied due to collision with previous entries and saturation of their slots.

The UX on this demo could be better - e.g. not insert into the bloom unless the cuckoo insert is successful (to keep them from diverging). Also, a message indicating an unsuccessful insertion would be nice.

bdupras | 8 years ago | on: Probabilistic Filters By Example

A few types of filters support deletion (e.g. counting bloom filters and cuckoo filters). To my knowledge all of them require prior knowledge that an entry was successfully inserted. That is, if you guess at deleting an entry and are successful, then you'll ruin the filter by creating a false-negative case.

As for collisions, only a limited number of colliding entries can be inserted into a cuckoo - after which a subsequent insertion will knowingly fail. So say Alice and Bob hash out to the same values, and you've inserted 2 Alices and 1 Bob. When you delete one Alice, it doesn't matter which entry is removed - two entries still remain that hash to Alice and Bob.

Perhaps the nuance you mention is that the alternate bucket index for an entry is calculated only from the fingerprint, not from the entry data. This has the effect that even when multiple entries share fingerprints, any one can be removed. (Again, _only_ if the system requesting the delete positively knows that the entry was previously successfully inserted.) The other colliding entry will still be found on query and will generate a "maybe" response.

bdupras | 9 years ago | on: Probabilistic Filters By Example: Cuckoo Filter and Bloom Filters

Yeah, absolutely. You get the same "feature" from Counting Blooms - i.e. if an insertion would overflow any of the counters, the filter can knowingly reject the insertion.

This is both a feature and a limitation. In our use case, our filter grows predictably, and every few months a rejected insertion triggers a resize/rebuild from source data. Because we could tolerate a full rebuild, we chose for now not to implement the technique of growing the filter by adding successively larger filter segments with lower fpp guarantees, though that is an option.

bdupras | 9 years ago | on: Probabilistic Filters By Example: Cuckoo Filter and Bloom Filters

Regarding the second point on deletions, the cool thing about the work by Bin Fan with the Cuckoo filter is that you get the added benefit of deletion in almost the exact memory footprint as a standard Bloom. This was the critical feature that caused us to implement the structure for our use case.

bdupras | 9 years ago | on: Probabilistic Filters By Example: Cuckoo Filter and Bloom Filters

To further clarify, if a Bloom filter query is O(K) where k is the number of hash functions, then when k=7, the filter could be described as O(1).

The constant factors here are important for practical use. Hashing 7 times is significantly different than hashing 1 or 2 times.

For the original commenter, give me a suggestion if you don't mind. How would you expect to see that info depicted in a comparison chart so that it's honest and obvious?

bdupras | 9 years ago | on: Probabilistic Filters By Example: Cuckoo Filter and Bloom Filters

It isn't clear in the demo visualization, but insertions into the Cuckoo filter do get rejected after exhausting a maximum number of kick/relocation attempts. If I find some time I'll make that more apparent.

As for false negatives, this can happen in both Cuckoo and Counting Bloom filters if a value that is never added is removed, and the value would have resulted in a false positive on query. The contract for removal is that you must know that the value is currently in the filter. I'll clarify that in the text.

bdupras | 9 years ago | on: Probabilistic Filters By Example: Cuckoo Filter and Bloom Filters

Guilty. I wanted to be up-front about the constant factor of 1 or 2. I was tired when I wrote that and was actually looking at cleaning that up a couple days ago, but got distracted. Soon.