Yes, correct, and that fetching operation is a non-blocking operation, once we dequantize the weights we discard it before moving to the next layer.
Technically, you can do it for the weights as well. But that wouldn't work in many situations. For example, when training with FSDP: the quantized weights stay on the device but you can still offload the meta-data (https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html)
I would like to re-iterate that larger models, which would be more interesting to run at low-bits, are much less sensitive to quantization compared to a 7B. So you could potentially use a larger group-size and just keep it on device, like what is done with 4-bit and 3-bit now using a group size of 64. We just started running some experiments with a 13B llama2 and it looks very good so far (outperforming some full-precision llama2-13B-based models), let's see how far we can push it, ideally get-rid of the reshaping all together will be great.
mobicham|1 year ago
Technically, you can do it for the weights as well. But that wouldn't work in many situations. For example, when training with FSDP: the quantized weights stay on the device but you can still offload the meta-data (https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html)
I would like to re-iterate that larger models, which would be more interesting to run at low-bits, are much less sensitive to quantization compared to a 7B. So you could potentially use a larger group-size and just keep it on device, like what is done with 4-bit and 3-bit now using a group size of 64. We just started running some experiments with a 13B llama2 and it looks very good so far (outperforming some full-precision llama2-13B-based models), let's see how far we can push it, ideally get-rid of the reshaping all together will be great.
vladf|1 year ago
You could just do a layer-by-layer fetching scheme with 4 bit weights.
For training too, just fetch each layer twice per step as needed for fwd/bwd.
And all for hbm cost equal to one layer’s worth