top | item 39885612

(no title)

mobicham | 1 year ago

Yes, correct, and that fetching operation is a non-blocking operation, once we dequantize the weights we discard it before moving to the next layer.

Technically, you can do it for the weights as well. But that wouldn't work in many situations. For example, when training with FSDP: the quantized weights stay on the device but you can still offload the meta-data (https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html)

I would like to re-iterate that larger models, which would be more interesting to run at low-bits, are much less sensitive to quantization compared to a 7B. So you could potentially use a larger group-size and just keep it on device, like what is done with 4-bit and 3-bit now using a group size of 64. We just started running some experiments with a 13B llama2 and it looks very good so far (outperforming some full-precision llama2-13B-based models), let's see how far we can push it, ideally get-rid of the reshaping all together will be great.

discuss

vladf|1 year ago

If you’re willing to pay for the latency cost of per layer cpu fetching/offloading, I don’t see what extreme quant buys you.

You could just do a layer-by-layer fetching scheme with 4 bit weights.

For training too, just fetch each layer twice per step as needed for fwd/bwd.

And all for hbm cost equal to one layer’s worth

mobicham|1 year ago

The extreme quant buys you potentially 70x more efficient matmul via binary/ternary operations.

You still have a group-size of 64 in 4-bit fyi.And even if you keep the meta-data on-device, provided that the quality is high (which is the case for 2-bit, outperforming fp16 on certain tasks), that is a much better option compared to 4-bit even if the VRAM usage is the same.

Again, and I keep repeating this but it seems to be ignored every time: this is experimental work and it's still in progress. This story of small group-sizes on large models should not be an issue.