(no title)
mobicham | 1 year ago
Technically, you can do it for the weights as well. But that wouldn't work in many situations. For example, when training with FSDP: the quantized weights stay on the device but you can still offload the meta-data (https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html)
I would like to re-iterate that larger models, which would be more interesting to run at low-bits, are much less sensitive to quantization compared to a 7B. So you could potentially use a larger group-size and just keep it on device, like what is done with 4-bit and 3-bit now using a group size of 64. We just started running some experiments with a 13B llama2 and it looks very good so far (outperforming some full-precision llama2-13B-based models), let's see how far we can push it, ideally get-rid of the reshaping all together will be great.
vladf|1 year ago
You could just do a layer-by-layer fetching scheme with 4 bit weights.
For training too, just fetch each layer twice per step as needed for fwd/bwd.
And all for hbm cost equal to one layer’s worth
mobicham|1 year ago
You still have a group-size of 64 in 4-bit fyi.And even if you keep the meta-data on-device, provided that the quality is high (which is the case for 2-bit, outperforming fp16 on certain tasks), that is a much better option compared to 4-bit even if the VRAM usage is the same.
Again, and I keep repeating this but it seems to be ignored every time: this is experimental work and it's still in progress. This story of small group-sizes on large models should not be an issue.