The essence of the trick: make the sign bit the least significant bit. Then all the leading bytes for integers of small magnitude remain zeros (not 0xFF like in two-complement integers), and compression becomes easy.
This is an amazing explanation, and immediately makes the formula `(val<<1) ^ (val>>31)` intuitive, compared to trying to interpret it as mapping x -> 2|x| - 1.
I used this trick for delta compression for an ArrayBuffer backed data exchange format. Since it was getting sent from a web server, we got gzip for free, but could still shrink data sizes by filtering it in such a way as to make it easier to compress. Delta compressing correlated data points yields a lot of small, but randomly distributed positive/negative values, so by zig-zag encoding them but saving as 32bit values, you end up with a lot of runs of 0's, which get compressed very efficiently, and only require a quick pass through the data to restore the original values
Came here to say the same thing. When I first saw the protobuf zigzag encoding, I saw the same pattern. At first, I thought I must be wrong because they aren't describing it this way. Nope, it really is that simple.
In earlier computer systems (pre 1980s), this kind of variety was more common. In my career, I have only been exposed to twos-complement or float formats. I would bet that this is true for many other engineers as well.
I'm glad the protobuf folks used it. I'm also glad Dr. Lemire wrote an article about it. It is another clever trick to have in the toolbox.
The `fast_decode` as written is undefined behavior for some inputs (INT_MAX for example) because of the integer overflow in `2*x`. This can be fixed by casting to an unsigned int first.
Also a warning, right shifting a negative number gives the expected result (arithmetic right shift) in C++20, but is still implementation defined even in the C23 draft spec. Likely you're good, but the spec doesn't guarantee it.
It's pretty cool how a simple encoding scheme can have profound effects on compression. This is a big part of what makes DCT-style block compression work as well as it does. You rig the game such that all of your higher frequency components wind up at the end - ideally as zeroes after quantization. So, a simple RLE scheme can do a lot more work than it otherwise could.
reminds me of this video on using "-2" as a base for numbers (can represent all positive and negative integers without a sign bit) https://www.youtube.com/watch?v=GWA_NraxOUw
Protobuf was the first major project I was aware of that uses zigzag encoding for integers. They then encode those integers using the typical 7 bit encoding scheme where the msb indicates there's another byte following. I'm sure they weren't the first by any means though.
I'm currently using this in a binary format for serializing dynamic JSON-like values that I invented and am implementing in Rust. I will release it as open source sometime next year.
They use a variant of that in most Modern SAT solvers for the literal encoding by the way, i.e. in order to encode negation of variables, which traditionally are represented with negative integers. Mostly, because they then use the encoded literal as indeces in an array. Historically, this also had performance reasons if I remember correctly. I feel it has no benefit anymore, and people simply got used to it (and never touch that part of the code anyway). But I might be wrong. I did not yet benchmarked any alternative, but it is on my todo-list (woth low priority).
A nice bijection from integers to natural numbers. Mappings exist for rationals -> natural numbers and other countable sets, but may be less practical.
it's cool just the name is enough to tell you what the encoding is. he didn't mention it in the post but it can be visualized as starting from 0, then zigzagging right and left to 1, -1, 2, -2 ... and labelling each point with consecutive natural numbers.
[+] [-] nine_k|3 years ago|reply
[+] [-] krackers|3 years ago|reply
[+] [-] hackcasual|3 years ago|reply
[+] [-] drudru|3 years ago|reply
In earlier computer systems (pre 1980s), this kind of variety was more common. In my career, I have only been exposed to twos-complement or float formats. I would bet that this is true for many other engineers as well.
I'm glad the protobuf folks used it. I'm also glad Dr. Lemire wrote an article about it. It is another clever trick to have in the toolbox.
[+] [-] 10000truths|3 years ago|reply
[+] [-] raphlinus|3 years ago|reply
Also a warning, right shifting a negative number gives the expected result (arithmetic right shift) in C++20, but is still implementation defined even in the C23 draft spec. Likely you're good, but the spec doesn't guarantee it.
[+] [-] bob1029|3 years ago|reply
[+] [-] benj111|3 years ago|reply
Ive just read up about DCT and don't really have a clue, but it doesn't seems applicable?
That being so. I'm not sure how RLE at the bit level would help. Surely encoding run lengths of 5ish bits isn't going to compress much of anything.
[+] [-] IanWard1|3 years ago|reply
[+] [-] eloff|3 years ago|reply
I'm currently using this in a binary format for serializing dynamic JSON-like values that I invented and am implementing in Rust. I will release it as open source sometime next year.
[+] [-] anonymoushn|3 years ago|reply
So it would be sweet to instead use a format for which it is possible to write fast decoders
[+] [-] tikhonj|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] ramranch|3 years ago|reply
[+] [-] nynx|3 years ago|reply
[+] [-] phoboslab|3 years ago|reply
[1] https://github.com/LordJZ/libflac/blob/master/src/libFLAC/bi...
[+] [-] qsort|3 years ago|reply
https://en.wikipedia.org/wiki/Pairing_function
[+] [-] squaredot|3 years ago|reply
[+] [-] k0k0r0|3 years ago|reply
[+] [-] Ferrotin|3 years ago|reply
[+] [-] deleterofworlds|3 years ago|reply
[+] [-] fluoridation|3 years ago|reply
[+] [-] DontchaKnowit|3 years ago|reply
[+] [-] bo1024|3 years ago|reply
[+] [-] pwg|3 years ago|reply
[+] [-] zaptheimpaler|3 years ago|reply
[+] [-] shoo|3 years ago|reply
[+] [-] bugfix-66|3 years ago|reply
https://bugfix-66.com/2c1df73cab89ec76d6fa10caf8a27c1fbe4d16...
and the decoder:
https://bugfix-66.com/1efa93a5eb0cc12b3de7cd1dab8e471a2cc95e...
The common varint, which you see in applications everywhere (e.g., git), is just base-128 version of the above general scheme!
But base-32 or base-8 or base-64 varints can be a big win for some purposes. Remove the zig-zag encoding if negative integers don't occur.