top | item 38556427

(no title)

mobicham | 2 years ago

Very excited to share our latest work on model quantization. No data calibration needed, extremely fast , works on both language and vision models!

Code: https://lnkd.in/dM_NgSCQ Models: https://lnkd.in/dyw4x6Ga

Why does it matter? Quantization significantly reduces GPU memory requirements but degrades the quality of the models. Having faster and more accurate quantization methods is extremely valuable for the ML community.

Approach: Sparsity-based error formulation between the original weights and their dequantized version. We used a Half-Quadratic solver to derive a closed-form solution that is 100x faster than backprop via Pytorch's Autograd.

Quantization speed: ~ 1 minute for Llama2-13B ~ 4 minutes for LLama2-70B (over 50x faster than GPTQ)

Findings: - Larger models quantized to 3/2-bit outperform smaller full-precision models with similar or lower memory requirements. - Successful 2-bit quantization requires a lower group-size (e.g., 32 or 16) and compression of both the zero-point and the scaling factor for lower memory usage.

* LLama2-70B-2bit (~26GB) > LLama2-13B-16bit (~26GB) * LLama2-13B-3bit (~7.5GB) > LLama2-7B-8bit (~7.5GB)

discuss

No comments yet.