top | item 46963047

The mathematics of compression in database systems

61 points| agavra | 21 days ago |bitsxpages.com

16 comments

order

thesz|16 days ago

Arithmetic coding of a single bit preserves ordering of encoded bits, if CDF(1) > CDF(0). If byte's encoding process is going from higher bits to lower bits, arithmetic coding (even with dynamic model) will preserve ordering of individual bytes.

In the end, arithmetic coding preserves ordering of encoded strings. Thus, comparison operations can be performed on the compressed representation of strings (and big-endian representations of integers and even floating point values), without the need to decompress data until that decompressed strings are needed.

Another view: strings are compared by memcmp as if they are mantissas with the base 256. "hi!" is 'h'(1/256)+'i'(1/256)^2+'!'(1/256)^3+0(1/256)^4 and then there are zeroes to the infinity. Arithmetic encoding represents encoded strings as mantissas where base is 2. Range coding can utilize other bases such as 256.

codeismightier|16 days ago

Intriguing claim! At first I was skeptical, thinking that there would be an issue with leading zeros being discarded. However, with some piloting, I was able to use Claude and ChatGPT to construct a proof of your claim.

Sketch of the argument:

First, an arithmetic coder maps strings to non-overlapping subintervals of [0, 1) that respect lexicographic order.

Second, the process of emitting the final encoding preserves this. If enc(s) ∈ I(s), enc(t) ∈ I(t), and I(s) <= I(t), then enc(s) <= enc(t).

Finally, binary fractions compared left-to-right bitwise yield the same order as their numerical values — this is just memcmp.

Thus, we have a proof of your claim that arithmetic coding preserves lexicographic order! Nice result!

My mistake was in thinking that leading zeros are discarded -- it is tailing zeros that are discarded!

srean|17 days ago

Then there is this eternal conversation about whether on should encrypt and then compress or compress and then encrypt.

Encrypted data will not compress well because encryption needs to remove patterns and patterns are what one exploits for compression.

If you compress and then encrypt, yes you can leak information through the file sizes, but there isn't really a way out of this. Encryption and compression are fundamentally at odds with each other.

smilliken|17 days ago

Compress then encrypt is not an option because your encryption is broken if it can be compressed at all. Mathematically it's a near certainty that the compression would increase the file size when given an encrypted input.

rcxdude|16 days ago

The biggest risk if if you are compressing secrets alongside attacker-controlled data. Then there's a host of attacks on the secret that become possible.

UltraSane|16 days ago

Encrypt then compress is pointless.

ozgrakkurt|17 days ago

Really interesting.

I was trying to implement a compression algorithm selection heuristic in some file format code I am developing. I found this to be too hard for me to reason about so basically gave up on it.

Feels like this blog post is getting there but there could be a more detailed sets of equations that actually calculate this based on some other parameters.

Having the code completely flexible and doing a full load production test with desired parameters to find the best tuning is an option but is also very difficult.

Also read this previously which I find similar.

https://rocksdb.org/blog/2021/12/29/ribbon-filter.html

agavra|16 days ago

the compression algorithm you select for your data is quite dependent on the dataset you have. the equations in this blog post don't help you choose which compression to use, but rather "how much" and when to compress. I would be curious to formalize the math for different compression algorithms though... might be a good follow up post!

hinkley|15 days ago

Seems like it should be possible to compress the hell out of any column with an index. Problem is you can always drop an index on a running system.

esafak|17 days ago

Your SSL cert is invalid.

pixl97|16 days ago

Unless the user just updated it, the current cert has been valid for the last 12 days. Maybe you're getting MITM'ed or don't have the root in your trust store?