The Opposite of a Bloom Filter

[+] MichaelGG|13 years ago|reply

Took me a few reads, but it's actually quite simple.

You want a structure that can tell you something has been seen, but sometimes forgets, but will never incorrectly tell you something has been seen.

Solution: an array. Hash the item to find its index, swap out what's there, and see if it is your item. If so, you know for sure it was previously placed. If not, then it might not have (it may have been forgotten).

  let arr = Array of key
  let contains_key key = 
    let index = hash key % arr.length
    let prev_key = arr.swap[index, key]
    return prev_key = key

The hash algorithm is crucial. Reducing forgetfulness is as simple as making the array longer. And he points out that if you can compress the keys, you can reduce storage size.

[+] xianshou|13 years ago|reply

@jderick: Specifically, it functions the same as a cache with a random removal policy, assuming a good hash function. So a cache and a Bloom filter are opposites: the first knows when something hasn't been seen, and might know when it has; the second knows when something hasn't been seen, and might know when it hasn't. The only difference between this and a normal cache is that we're dropping the values.

On that subject, you might as well have an LRU valueless cache if locality is going to play a role.

[+] jderick|13 years ago|reply

So it's just a cache?

[+] IvarTJ|13 years ago|reply

So combining the two will not give you something that will make up for each of the ones weaknesses and always give a definite answer, but will give one of three answers – definite positive, maybe, and definitive negative.

          | P | M | N
    Bloom | X | X |
    Oppos | X |   |

[+] gojomo|13 years ago|reply

Seems like a cache using open addressing but no probing, just instant eviction on any initial collision.

So, could also consider adding probing to lessen destructive collisions (up to some chosen probing depth) while there are other unused slots, in still-constant average time. Or, any other choice-of-slots approach such as 'cuckoo hashing':

http://en.wikipedia.org/wiki/Hash_table#Cuckoo_hashing

Whether you'd want to pay that cost could depend on how soon you reach saturation within each reset period, after which 'every' insert results in an eviction. But if you were usually in one-for-one eviction mode, you might add a bit or more that hints 'age' or 'recent access' to probed items. Then each eviction could be biased towards an older/less-accessed value... taking more advantage of the clustering-in-time that was reported.

And once you add that aging bit you might not need discrete full-restart periods ("hourly") at all: just make a useful proportion of older entries eligible for eviction each interval, maintaining a steady-state of full-array, evictions-biased-towards-older-entries.

[+] ajb|13 years ago|reply

This is useful. However, it is larger than a bloom filter, because a bloom filter needs one bit where this stores a value. I suspect that this asymmetry is unavoidable, but I can't prove it. Anyone?

[+] gms7777|13 years ago|reply

These sorts of filters have limits to accuracy and usefulness (and a definite tradeoff between memory requirements and accuracy). You can take both to the extreme, while still maintaining the definition:

A structure that may report false positives, but no false negatives: A single bit set to 1

A structure that may report false negatives, but no false positives: A single bit set to 0

Obviously that's entirely useless, but I think it shows that you can't make any definitive proof of size requirements

[+] eblume|13 years ago|reply

This is a very good point, and one I think the author completely missed.

I use bloom filters in an application at work and was very interested to hear about this, wondering if some sort of hybrid might be useful. But when I saw that the hash map was storing the actual objects, I immediately dismissed the idea for my use case - we can't afford to store anything more than a handful of bits. (Our bloom filters set between 9 and 11 bits per signature.)

The key advantage to bloom filters is that the represent set identity lookups (albeit in an adjustable lossy false-positive way). They do not, however, store actual sets of data. They must be coupled with a storage mechanism (traditionally with a several-order-of-magnitude slower lookup speed than the bloom filter) to actually store the items contained within the set - the bloom filter is just a way of definitively answering the question "Could this set possibly contain this item?"

[+] shasta|13 years ago|reply

You'd need to make your claim more precise, in order to prove it, but a proof might go something like this: assume that an anti-bloom filter starts empty, and that it over time sees some small fraction of the possible objects (because objects are 64+ bits, say). If the anti-bloom filter usually reports "yes" for a significant fraction (even just 1%) of the seen objects, then an information theoretic argument tells you it must have size around the number of objects times the object size.

[+] paul|13 years ago|reply

I'm skeptical that "hashing is much faster on an array the size of a power of two".

Also, if memory is actually an issue, he should store hashes instead of full objects (assuming the byte arrays are more than 16 bytes).

[+] ericsink|13 years ago|reply

IIRC, the real reason for this "power of 2" is for fast rehashing when the table grows. Calculate a 32 bit hash and store it all, but if your table currently has 1024 entries, only use 10 bits of it. Later, when your table grows to 2048 entries, you can just mask off 11 bits worth instead of recomputing every hash.

But, since the hash table is not going to grow in this particular case, any perf gain from using a power of 2 will probably be far less important.

[+] AlisdairO|13 years ago|reply

It's fairly common to keep array sizes bounded at a power of two, because when you take a hash you can use bitwise shifting instead of a modulus to determine the hash bucket to use. Depending on the performance of your hash function it can have a noticeable impact, although you're right that it's probably not a matter of 'much' faster.

[+] pdeuchler|13 years ago|reply

This may be naive of me, but wouldn't it be better to leverage your language's existing tools? For instance, in Python you would use a dict...

  >>> my_dict = {}
  >>> my_dict[item] = True #Insert an item
  >>> try:
  ...    tmp = my_dict[item]
  ...    print "Found"
  ... except KeyError:
  ...    print "Not found"

I can't imagine a custom implementation would be more efficient, though there may be other reasons to roll your own.

[+] sbov|13 years ago|reply

The problem is your solution (as shown) will grow in memory forever.

You can add code to get around that (e.g. expiration times), but it also might be easier (and have less memory usage) to write your own dict class that overwrites values when a collision occurs.

Also, since the OP seems to be using java: if you can get away with storing a primitive data type in an array (e.g. array of int or long) you can save a lot of memory vs using a Set or Map.

[+] philsnow|13 years ago|reply

Python dicts will grow and grow as you add more keys to them, but one of the constraints mentioned in the article (and implicitly included, because he's talking about a variation on a bloom filter) is that the data structure should not grow without bound. They wanted to use a fixed-size array.

Also, in your example you're storing True as the value, but if you do that, you can't tell the difference between rbloom_array[hashfunc(item1)] and rbloom_array[hashfunc(item2)], if hashfunc(item1) == hashfunc(item2) (that is, they collide). You'd need to store the actual item in the data structure, not just a bit, and then you compare what you get out of the data structure to what you're trying to put in it. If the _items_ are equal then it's a duplicate and you can skip the write (to the database, from the example). If the items are not equal, you need to do the write.

[+] unknown|13 years ago|reply

[deleted]

[+] unknown|13 years ago|reply

[deleted]

[+] unknown|13 years ago|reply

[deleted]

[+] davidkellis|13 years ago|reply

Couldn't you build a composite data structure that internally makes use of both a bloom filter and one of these "opposite of a bloom filter" structures, and get the best of both worlds - no false positives and no false negatives?

[+] EricBurnett|13 years ago|reply

Sadly not - instead you'd get disagreement from the two structures, with no way to resolve it.

If the bloom filter says 'yes I've seen it!', and the opposite filter says 'nope, not here', which is accurate? You know it's either a false-positive from the bloom filter or it a false-negative from the opposite filter, but that's the limit of your information. Effectively, you've partitioned your set into three: [definitely not, ambiguous, definitely seen].

If you want to know 100% of the items you've stored with full fidelity, you need to store information for all of them (e.g. in a hash table). Which is a valid thing to do, but requires a lot more memory than the author of this article was willing to use.

[+] gwillen|13 years ago|reply

There's no free lunch; anything that can track a set perfectly is going to end up taking as much space as just maintaining it in a more straightforward way.

In the case of your specific suggestion, though, the problem is this: what if the bloom filter says "I've seen it (but could be lying)", and the "opposite" filter says "I've never seen it (but could be lying)"? Then you still have no idea whether the object is in the set.

[+] anonymoushn|13 years ago|reply

This is an appealing idea and it took me a moment to see why it wouldn't work (other than the part where you simply don't have enough information).

It's correct to say that the bloom filter will produce no false negatives and the OOABF will produce no false positives, so if your OOABF returns true or your bloom filter returns false then you know for sure whether you've seen the item before. The trouble happens when your OOABF returns false and your bloom filter returns true, in which case you might have seen it before.

[+] MichaelGG|13 years ago|reply

The thing is, when the bloom filter returns "I've seen it", you still need to resort to the array that has the actual keys.

A good way to look at the problem is that you have N keys to store (because you cannot have false positives). So with N keys, if you want less than sizeof(N)*N bytes, the only real answer is compression. So either you just discard data (what the OP does), or find a clever way to compress things.

The benefit is that you're only testing for existence, so that gives you some leeway. For instance, suppose you are getting random 8-byte user IDs. You could store, say, 1M of them in a hashtable, then take those items, sort them, and store deltas. Instead of 8 bytes per item, you'd only need 44 bits on average (2^64/1M) to store the differences. I believe this is what the OP suggested at the end of the article.

So, it really depends on the key types, and the penalty for forgetting an item. At a really high penalty (say, something that needs to perform an ACID SQL transaction), maintaining a compressed block of IDs is rather attractive. Whereas, if it's just sending an extra write to a high-perf write store, maybe it doesn't matter, and the OPs quick forgetful array is useful.

[+] afc|13 years ago|reply

Um, wouldn't just a cache of the last N (based on memory available) items (or their hashes, whatever) just do the job, and do it better than his solution (which removes elements randomly based on collisions, rather than just removing the oldest element in the cache when it gets full, a much better approach given the time locality he describes)?

The description in this article seems to add too much complexity ("the opposite of a bloom filter") for what really is just a simple cache (albeit, IMO, a relatively inefficient one for the problem he describes).

[+] cnlwsu|13 years ago|reply

I had pretty much exact problem. We keep index of row keys in Cassandra on time partitioned data (by hour), but do not want to re-insert the index each time there is a piece of data.

A ConcurrentLinkedHashMap (http://code.google.com/p/concurrentlinkedhashmap/) solved it easily. It does use more RAM though... but an insignificant amount on our heap.

[+] sprsquish|13 years ago|reply

This is almost exactly what our use case is. I slightly altered the implementation to be more GC friendly though: http://squishtech.posterous.com/addendum-to-the-opposite-of-...

[+] scotty79|13 years ago|reply

I'd implement it exactly like bloom filter but instead bits I'd use 0,1,2 0 means never touched. 1 means I've seen it, 2 means I've seen it more than once.

True positive is when all places for given entry contain 1.

Adding is incrementing all places (if they have value less than 2).

Before putting new value in check if it's not already here. If it is don't add.

[+] eridius|13 years ago|reply

That's not going to work at all. If I have an object A that hashes to bits 1, 5, 17, 24; an object B that hashes to bits 1, 5, 13, 15; and an object C that hashes to bits 2, 9, 17, 24; and I add objects B and C, then the filter will incorrectly think A has already been seen.

[+] bkirwi|13 years ago|reply

This doesn't work. If an object hashes to places abc, and each of a, b, and c were already incremented independantly by previous insertions, you would get a false positive.

[+] mtrimpe|13 years ago|reply

Another option, in case you know the bounds of the parameter space (eg all natural numbers under 1000) you can implement this with a counting bloom filter that has every possible element pre-inserted.

[+] webreac|13 years ago|reply

How to give fancy names to a simple algorithm using a hash map. I use perl a lot and as a result, this kind of algorithm occurs always.

[+] webreac|13 years ago|reply

In fact, it is simpler than a hash. The provided code is not complete (the limit to 1 hour is missing). Here is a full functional solution to its problem:

  perl -MDigest::MD5 -nle '($id,@data)=split;$t=time;$j=unpack("L", Digest::MD5::md5($id))%100;print join(" ",$id,@data) if $t-$h{$j}>3600 or $hi{$j} ne $id;$h{$j}=$t;$hi{$j}=$id' input

[+] biot|13 years ago|reply

False positives are possible but rare due to hash collisions. With a bloom filter, false negatives are impossible.

[+] mason55|13 years ago|reply

False positives are possible but rare due to hash collisions

When there is a collision you check to see if the object living there is the one you're trying to store. If it's not then you assume negative. If it is the object you're checking then it's positive but it's a correct positive.

If it's not the object you're checking then you evict the object living there and put in the one you're checking, thus allowing the one you evicted to cause a false negative again.

The whole system works BECAUSE of intentional collisions. If there were no collisions it would just be a hashmap and you'd have unbounded data structure size.

57 comments