top | item 44561895

Extending That XOR Trick to Billions of Rows

104 points| hundredwatt | 8 months ago |nochlin.com | reply

22 comments

order
[+] dataflow|8 months ago|reply
Before you get too excited, this is a probabilistic algorithm, not a deterministic one. Feels weird to call it an "extension" when you lose an absolute guarantee, but still cool nonetheless.
[+] nyrikki|8 months ago|reply
> Finally, they introduce Invertible Bloom Filters, which add an exact get operation and a probabilistic listing operation.

I haven't spent time digging into the implementation details, but the exact get should allow for verification.

It is not uncommon to use probabilistic methods to reduce search space.

[+] hundredwatt|8 months ago|reply
You don't lose absolute guarantees, but the probabilistic nature means the process may fail (in a guaranteed detectable way) in which case you can try again with a larger parameter.

The "bloom filter" name is misleading in regard to this.

[+] nullc|8 months ago|reply
iblt has low space efficiency for small sets, small elements, and low failure rates (and on that note is only probabilistic in its success).

We implemented https://github.com/bitcoin-core/minisketch which has optimal size efficiency-- N bits of state will always correctly recover when there are N or fewer bits of set difference, even when the set elements are small (like 32 bits, for example).

So for example you can you and I can each have sets of, say, ten thousand 32-bit elements which are identical except for 10 entries, and I can send you a 320 bit (32*10) sketch of my set and from that you can always determine the 10 (or fewer) differences. The same element and difference size with IBLT would likely take thousands of bits to have a low failure rate.

The downside is that the minisketch approach has quadratic decode complexity in the size of the set difference, but this is not a big deal when the number of differences is small by construction or thanks to recursive subdivision.

For cases where the differences are large iBLT eventually wins out-- the two ideas can also be hybridized in a variety of ways. E.g. using minisketch to make multi-element buckets in an iblt analogous to blocked bloom filter or the normal practice with cuckoo filters.

Another related scheme is cpisync which was used for many years by SKS key servers. It has communications efficiency like minisketch, but cubic decode costs.

[+] dzaima|8 months ago|reply
A rough sketch for a more direct way to extend the XOR trick to finding more than two differences:

For e.g. 3 differences: instead of a binary xor (i.e. binary-digit-wise sum mod 2), do a binary-digit-wise sum mod 3 (negating one input); a 0 (mod 3) sum result for a given bit means that the bit is the same in all entries, and 1 (mod 3) or 2 (mod 3) mean that you can partition on the bit, resulting in partitions with sizes `sum` and `input_different_element_count - sum`; then you repeat this recursively until they reach containing just 1 difference. (rounding the modulo up to the next power of two instead of odd modulos for the summing is perfectly fine, the final infinite-precision sum is in the range of [0; diffcount] anyway)

Extends trivially to more than 3 differences, and collapses to the basic trick for 2 differences. The accumulator size is O(log(diffcount) * element_size), but the recursive partitioning takes O(n) space or O(diffcount * n) time (plus some logarithm something maybe). Tradeoffs are probably reasonably possible, but the basic hashset approach can reduce its O(n) space requirement at the cost of taking >O(n) time too by partitioning on a hash.

[+] javcasas|8 months ago|reply
So the XOR initial trick is: use a hash to partition the data into batches so that each batch has up to 1 missing element.

Can't we use this again? I mean:

1. Partition the data so that some batches have up to 1 missing element.

2. Recover the elements where possible with the XOR trick.

3. Pick another hash function, then repeat finding more missing elements.

4. Repeat until no more missing elements.

[+] hundredwatt|8 months ago|reply
The graph constructed by using bloom filter-style hash functions supports a decoding process called "peeling" where you:

1. Find a batch with 1 missing element 2. Delete that element from its other assigned partitions 3. Repeat, as the modified batches may now be recoverable

This iterative process (surprisingly!) succeeds with very high probability as long as the number of partitions is 1.22x larger than the number of missing elements with k=3 hash functions.

[+] dzaima|8 months ago|reply
A property of the initial XOR trick for 2 different elements is that it guarantees finding a way to partition in one pass (and with very trivial code; no hashing involved!), which is lost by replacing that with hashing. (the original trick does take two passes - finding the bit to partition on, and doing the actual partitioning, whereas hashing is 1+ε passes, but the first pass in the original is just an xor-fold, and the partitioning really only needs to be a "accumulator ^= (current_val & mask) ? current_val : 0" (other partition is just xoring the results of both passes), both of which can be trivially parallelized and SIMD'd with O(1) extra memory usage)

The approach in my comment achieves guaranteeing finding partitions, and still avoids actual hashing or anything strictly-probabilistic, but does still lose the extreme triviality and mechanical sympathy of the original approach.

[+] dark-star|8 months ago|reply
It would be nice if they explained what XOR trick that is. It seems to have something to do with finding missing numbers in a list?
[+] foota|8 months ago|reply
It's a solution to the problem: given a list of n-1 unique integers 1 through n, find the missing integer.

The trick is that when you xor all of the numbers in the list together and then xor that with the xor of 1 through n, the result is the missing number.