top | item 34581542

(no title)

I'm not sure how far the author has gone, but they should check out Gorilla compression[1] (just the compression part, not the whole database). It works well for time-series data, and might be suitable here? Basically if your numbers don't deviate massively--think of a CPU metric that stays in the same place throughout the day, inside the bounds of 0-100--the compression is really effective.

Clickhouse supports Gorilla and some others[2] that might also be of use.

[1]: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf [2]: https://altinity.com/blog/2019/7/new-encodings-to-improve-cl...

discuss

gopalv|3 years ago

> they should check out Gorilla compression

Gorilla is XOR compression which is better for timeseries where the metrics change smoothly from one to the next point, because it just XOR checks against the previous value.

Floats should really not be thought of as byte streams, instead they are 3 bit fields in a single word. Sign, mantissa, exponent split up into 3 streams compresses way better than them all together. At that point you are just dealing with "how to compress integers" which is much simpler problem.

I played with zstd and it compresses way better if you take 8 float64 and shuffle bits side ways. This is a trick that blosc popularized [1].

Adding a shuffle filter ahead of the zlib or zstd worked way better for reducing the size of the data when dealing with float streams. This does group the bits in a similar fashion to splitting up the floats into components, but is much simpler on the decode path with SIMD.

[1] - https://www.slideshare.net/PyData/blosc-py-data-2014/17