top | item 33744993

Making all your integers positive with zigzag encoding

109 points| jjgreen | 3 years ago |lemire.me

40 comments

order
[+] nine_k|3 years ago|reply
The essence of the trick: make the sign bit the least significant bit. Then all the leading bytes for integers of small magnitude remain zeros (not 0xFF like in two-complement integers), and compression becomes easy.
[+] krackers|3 years ago|reply
This is an amazing explanation, and immediately makes the formula `(val<<1) ^ (val>>31)` intuitive, compared to trying to interpret it as mapping x -> 2|x| - 1.
[+] hackcasual|3 years ago|reply
I used this trick for delta compression for an ArrayBuffer backed data exchange format. Since it was getting sent from a web server, we got gzip for free, but could still shrink data sizes by filtering it in such a way as to make it easier to compress. Delta compressing correlated data points yields a lot of small, but randomly distributed positive/negative values, so by zig-zag encoding them but saving as 32bit values, you end up with a lot of runs of 0's, which get compressed very efficiently, and only require a quick pass through the data to restore the original values
[+] drudru|3 years ago|reply
Came here to say the same thing. When I first saw the protobuf zigzag encoding, I saw the same pattern. At first, I thought I must be wrong because they aren't describing it this way. Nope, it really is that simple.

In earlier computer systems (pre 1980s), this kind of variety was more common. In my career, I have only been exposed to twos-complement or float formats. I would bet that this is true for many other engineers as well.

I'm glad the protobuf folks used it. I'm also glad Dr. Lemire wrote an article about it. It is another clever trick to have in the toolbox.

[+] 10000truths|3 years ago|reply
In other words, a left rotation by 1 bit.
[+] raphlinus|3 years ago|reply
The `fast_decode` as written is undefined behavior for some inputs (INT_MAX for example) because of the integer overflow in `2*x`. This can be fixed by casting to an unsigned int first.

Also a warning, right shifting a negative number gives the expected result (arithmetic right shift) in C++20, but is still implementation defined even in the C23 draft spec. Likely you're good, but the spec doesn't guarantee it.

[+] bob1029|3 years ago|reply
It's pretty cool how a simple encoding scheme can have profound effects on compression. This is a big part of what makes DCT-style block compression work as well as it does. You rig the game such that all of your higher frequency components wind up at the end - ideally as zeroes after quantization. So, a simple RLE scheme can do a lot more work than it otherwise could.
[+] benj111|3 years ago|reply
Sorry. Are you inferring this is how DCT works or are you making a comparison?

Ive just read up about DCT and don't really have a clue, but it doesn't seems applicable?

That being so. I'm not sure how RLE at the bit level would help. Surely encoding run lengths of 5ish bits isn't going to compress much of anything.

[+] eloff|3 years ago|reply
Protobuf was the first major project I was aware of that uses zigzag encoding for integers. They then encode those integers using the typical 7 bit encoding scheme where the msb indicates there's another byte following. I'm sure they weren't the first by any means though.

I'm currently using this in a binary format for serializing dynamic JSON-like values that I invented and am implementing in Rust. I will release it as open source sometime next year.

[+] tikhonj|3 years ago|reply
I first ran into this encoding scheme in Avro, which you could also describe as a format for dynamic JSON-like values :)
[+] ramranch|3 years ago|reply
Does your protocol provide any advantages over bincode?
[+] k0k0r0|3 years ago|reply
They use a variant of that in most Modern SAT solvers for the literal encoding by the way, i.e. in order to encode negation of variables, which traditionally are represented with negative integers. Mostly, because they then use the encoded literal as indeces in an array. Historically, this also had performance reasons if I remember correctly. I feel it has no benefit anymore, and people simply got used to it (and never touch that part of the code anyway). But I might be wrong. I did not yet benchmarked any alternative, but it is on my todo-list (woth low priority).
[+] Ferrotin|3 years ago|reply
Wouldn’t they use a plain bitmask instead, the lsb indicating negation, with 3 being the negation of 2? That’s zigzag encoding with a negative zero.
[+] deleterofworlds|3 years ago|reply
A nice bijection from integers to natural numbers. Mappings exist for rationals -> natural numbers and other countable sets, but may be less practical.
[+] fluoridation|3 years ago|reply
Easy enough. If you have a rational class represented as a pair of irreducible int32_ts, you can simply do ((u64)numerator << 32) | (u64)denominator.
[+] DontchaKnowit|3 years ago|reply
Found this recently in propreitary code base for (in abstract terms) storing enormous numbers of price values. Very interesting
[+] bo1024|3 years ago|reply
I have to fill out a captcha just to read an article / blog post? (Edit: and enable cookies)
[+] pwg|3 years ago|reply
Or install Ublock Origin in Firefox blocking the javascript and read the entire article without a captcha getting in the way.
[+] zaptheimpaler|3 years ago|reply
it's cool just the name is enough to tell you what the encoding is. he didn't mention it in the post but it can be visualized as starting from 0, then zigzagging right and left to 1, -1, 2, -2 ... and labelling each point with consecutive natural numbers.
[+] shoo|3 years ago|reply
Can anyone suggest high-quality resources for learning about compression?
[+] bugfix-66|3 years ago|reply
A really tremendous varint/VLQ encoder (using a zig-zag encoding and an generalized base):

https://bugfix-66.com/2c1df73cab89ec76d6fa10caf8a27c1fbe4d16...

and the decoder:

https://bugfix-66.com/1efa93a5eb0cc12b3de7cd1dab8e471a2cc95e...

The common varint, which you see in applications everywhere (e.g., git), is just base-128 version of the above general scheme!

But base-32 or base-8 or base-64 varints can be a big win for some purposes. Remove the zig-zag encoding if negative integers don't occur.