(no title)
WhitneyLand | 1 month ago
Back when 4k movies needed expensive hardware, no one was saying they could play 4k on a home system, then later mentioning they actually scaled down the resolution to make it possible.
The degree of quality loss is not often characterized. Which makes sense because it’s not easy to fully quantify quality loss with a few simple benchmarks.
By the time it’s quantized to 4 bits, 2 bits or whatever, does anyone really have an idea of how much they’ve gained vs just running a model that is sized more appropriately for their hardware, but not lobotomized?
zozbot234|1 month ago
int4 quantization is the original release in this case; it's not been quantized after the fact. It's a bit of a nuisance when running on hardware that doesn't natively support the format (might waste some fraction of memory throughput on padding, specifically on NPU hw that can't do the unpacking on its own) but no one here is reducing quality to make the model fit.
WhitneyLand|1 month ago
The broader point remains though which is, “you can run this model as home…” when actually the caveats are potentially substantial.
It would be so incredibly slow…
FuckButtons|1 month ago
Any model that I can run in 128 gb in full precision is far inferior to the models that I can just barely get to run after reap + quantization for actually useful work.
I also read a paper a while back about improvements to model performance in contrastive learning when quantization was included during training as a form of perturbation, to try to force the model to reach a smoother loss landscape, it made me wonder if something similar might work for llms, which I think might be what the people over at minimax are doing with m2.1 since they released it in fp8.
In principle, if the model has been effective during its learning at separating and compressing concepts into approximately orthogonal subspaces (and assuming the white box transformer architecture approximates what typical transformers do), quantization should really only impact outliers which are not well characterized during learning.
WhitneyLand|1 month ago
If this were the case however, why would labs go through the trouble of distilling their smaller models rather than releasing quantized versions of the flagships?
codexon|1 month ago
https://arxiv.org/abs/2402.17764
Ey7NFZ3P0nzAe|1 month ago
WhitneyLand|1 month ago
RandomTeaParty|1 month ago
unknown|1 month ago
[deleted]
Gracana|1 month ago
jasonjmcghee|1 month ago
Fwiw, not necessarily. I've noticed quantized models have strange and surprising failure modes where everything seems to be working well and then does a death spiral repeating a specific word or completely failing on one task of a handful of similar tasks.
8-bit vs 4-bit can be almost imperceptible or night and day.
This isn't something you'd necessarily see playing around, but when trying to do something specific
selfhoster11|1 month ago