(no title)
jhj | 10 months ago
Typical entropy of bfloat16 values seen in weights (and activations) are about 10-12 bits (only 65-75% or so of the value range is used in practice). Sign and mantissa bits tend to be incompressible noise.
This has been exploited several times before in the context of both classical HPC and AI, with lossless compression work from Martin Burtscher's lab (https://userweb.cs.txstate.edu/~burtscher/), fpzip from LLNL (https://computing.llnl.gov/projects/fpzip) and my library dietgpu from 2021 (https://github.com/facebookresearch/dietgpu) which we used to speed training on a large GPU cluster by about 10% wall clock time overall by losslessly compressing all data prior to send and decompressing upon receive (e.g., gradients, weights from backup, etc), which is still computing the same thing as it did before as it is lossless.
Also, rANS is more efficient and easier to implement in SIMD-like instruction sets than Huffman coding. It would reduce the performance latency/throughput penalties as well with DFloat11 (since we have to decompress before we do the arithmetic).
iandanforth|10 months ago
VladVladikoff|10 months ago
vessenes|10 months ago
As we know, quantizations are a critical tool for local LLM runners; RAM is typically the gating factor. Are you aware of other better lossless compression of BF16 weights out there?
The reason I ask is this Dfloat11 seems relatively easy to plug in to existing quantization workflows, but you seem dismissive of the paper -- I presume it's my gap in understanding, and I'd like to understand.
zorgmonkey|10 months ago
bjornsing|10 months ago
I doubt that very much. Thing is that inputs are multiplied with weights and added together in a neural network layer, and then the output becomes the input of the next layer in a cycle that can repeat up to a hundred times or more. When you get to the final output layer that 10^6 factor has been applied so many times that it has snowballed to a 10^600 factor.
ironbound|10 months ago
https://arxiv.org/html/2412.19437v2#S3
refibrillator|10 months ago
Using DFloat11, tokens/sec was higher only when compared relative to running inference with some layers offloaded to CPU.
Classic comp sci tradeoff between space and speed, no free lunch, etc.
Dylan16807|10 months ago
At least the cost to truncate and zero fill is small.
brookst|10 months ago
Would it be more efficient to calculate some kind of per-model or per-layer mean, and then only specify standard deviations, maybe by fp8 or smaller?
liuliu|10 months ago
hinkley|10 months ago
boulos|10 months ago