(no title)
vladf | 1 year ago
You could just do a layer-by-layer fetching scheme with 4 bit weights.
For training too, just fetch each layer twice per step as needed for fwd/bwd.
And all for hbm cost equal to one layer’s worth
vladf | 1 year ago
You could just do a layer-by-layer fetching scheme with 4 bit weights.
For training too, just fetch each layer twice per step as needed for fwd/bwd.
And all for hbm cost equal to one layer’s worth
mobicham|1 year ago
You still have a group-size of 64 in 4-bit fyi.And even if you keep the meta-data on-device, provided that the quality is high (which is the case for 2-bit, outperforming fp16 on certain tasks), that is a much better option compared to 4-bit even if the VRAM usage is the same.
Again, and I keep repeating this but it seems to be ignored every time: this is experimental work and it's still in progress. This story of small group-sizes on large models should not be an issue.
vladf|1 year ago
Results may vary :)
> Again, and I keep repeating this but it seems to be ignored every time: this is experimental work and it's still in progress. This story of small group-sizes on large models should not be an issue.
Apologies if something I said (or I guess did not say...) offended you! It's a hypothetical, and one IME is not so easy to achieve, but maybe you have different results. So I didn't want to comment on this, maybe it's possible (but LLMs don't scale up as easily in terms of quantization than other networks like image classifiers, in my experience).
> The extreme quant buys you potentially 70x more efficient matmul via binary/ternary operations.
To be clear, such hardware does not yet exist, and it's unclear if you really can have more efficient binary/ternary matmul if you need high-precision accumulators and more frequent broadcasting shiftss. It's again a complicated hardware question to answer if the sum total latency of doing many high-precision accumulations and many scales/shifts will be smaller (or, chip-area-wise, even feasible to implement), compared to a 4-bit baseline.