top | item 40006752

(no title)

jhj | 1 year ago

People in the HPC/classical supercomputing space have done this sort of thing for a while. There's a fair amount of literature on lossless floating point compression, such as Martin Burtscher's work or stuff out of LLNL (fpzip):

https://userweb.cs.txstate.edu/~burtscher/ https://computing.llnl.gov/projects/floating-point-compressi...

but it tends to be very application specific, where there tends to be high correlation / small deltas between neighboring values in a 2d/3d/4d/etc floating point array (e.g., you are compressing neighboring temperature grid points in a PDE weather simulation model; temperature differences in neighboring cells won't differ by that much).

In a lot of other cases (e.g., machine learning) the floating point significand bits (and sometimes the sign bit) tends to be incompressible noise. The exponent is the only thing that is really compressible, and the xor trick does not help you as much because neighboring values could still vary a bit in terms of exponents. An entropy encoder instead works well for that (encode closer to the actual underlying data distribution/entropy), and you also don't depend upon neighboring floats having similar exponents as well.

In 2022, I created dietgpu, a library to losslessly compress/decompress floating point data at up to 400 GB/s on an A100. It uses a general-purpose asymmetric numeral system encoder/decoder on GPU (the first such implementation of general ANS on GPU, predating nvCOMP) for exponent compression.

We have used this to losslessly compress floating point data between GPUs (e.g., over Infiniband/NVLink/ethernet/etc) in training massive ML models to speed up overall wall clock time of training across 100s/1000s of GPUs without changing anything about how the training works (it's lossless compression, it computes the same thing that it did before).

https://github.com/facebookresearch/dietgpu

discuss

order

dekhn|1 year ago

I've talked to several world class computer scientists/bioinformaticians who realized, on their own, that decompression is the fastest way to fill the cache with data (because the amount of data streaming in is smaller, and CPUs are VERY fast for decompression).

One of the key innovations in the AMBER MD engine that made it work OK on cheaper systems was lossless floating point compression. It still impresses me that you can compress floats, send them over MPI, and decompress them, all faster/lower latency than the transport can send the uncompressed data.

jhj|1 year ago

Not just MPI over a network. We can compress floats, send them over NVLink or PCIe to another GPU in the same host, and decompress and it can be faster than sending data raw between GPUs, that's the premise behind dietgpu even (it's cheap compression, not a great compression ratio, like 0.6-0.9x of original size, but it's extremely fast, 100s of GB/s throughput, with the idea that you're trying to race something that is similarly as fast. General floating point data could be quite incompressible or highly compressible, it really just depends upon what is being passed around).

The interconnects are improving at a slower rate in general than compute on the CPU/GPU is and it can be exploited.

mandarax8|1 year ago

I'm drawing my entire VRAM worth of Morton sorted points, do you think this compression technique lends itself well to streaming decompression? Ie decompress one warp of points at a time right before drawing? Or do I need larger batches to make it viable?

dragontamer|1 year ago

I'm not exceptionally good with compression, but I did a few experiments.

I've noticed that array-of-structs form is harder to compress than struct-of-arrays form. This is relevant because a floating-point number is really a struct containing 3 values: 1-bit sign, 8-bit exponent, and 23-bit mantissa.

--------

If you have 1000 floats, I would expect 1000-sign bits in order to be more compressible (whatever compression algorithm you use, it would better determine the pattern of signs).

Then 8000-bits of exponent would also be more compressible, because all the exponents would likely follow its own pattern.

Finally, the 23-bits of mantissa would probably be difficult to impossible for compression. But "extracting it out" before running a compression algorithm will give the sign + exponents more opportunity for comrpession.

--------

So yeah, just think about your data. And favor structure-of-arrays form for better compression.

dahart|1 year ago

Totally, SoAs are often easier to compress. I hadn’t thought about floats as being AoSs, and then unpacking the sign, exponent and mantissas of a batch of float into separate arrays, that’s an interesting way to look at it. You could probably take that 1 step further with floats or ints, and treat an array of numbers as an array of structures of 32 or 64 fields where every field is 1 bit. So to compress a batch of 1000 32-bit numbers, take all bit 31s in a row, then all bit 30s, etc.. Or another way to look at it is to stack the numbers in a column, transpose it, and read out the bits.

It’s worth noting that the algorithm in the article, and it’s cousin, delta encoding, both do already take advantage of any cases where the sign & exponent bits don’t change with every sample, and/or where the mantissa ULPs are constant for a while at a time. It’d be interesting to compare the explicit SoA of floats compression idea to XOR or delta compression, but I think it’s also interesting in a meta kinda of way that these two seemingly different ideas have similar outcomes.

gcr|1 year ago

One of my fun projects was compressing all community levels for SRB2Kart (based on the Doom II engine) into a single archive suitable for WebGL rendering of map previews.

I was able to get all 220MB of the default SRB2kart maps down into 12.9MB, and could even get down to 5MB with heavy texture compression. LZMA alone could only do 37MB, so I was pretty pumped.

One of the harder parts is compressing the vertex data and adjacency information (which linedefs connect which vertices). I never figured out a satisfying way of doing that, but this makes me want to try again...

throwiforgtnlzy|1 year ago

Waves across I-35 to Reality Labs.

Sounds like something Numba should include perhaps.

I retired from HPC around the era of iPython, Open MPI, Infiniband, OFED, GPFS/Panasas/Lustre, Rocks Cluster, and Condor/PBS/SGE.

tines|1 year ago

Wow, cool to see Dr. Burtscher's name here! I was a student in his undergraduate parallel computing course and was invited to work in his lab. Very smart guy.