(no title)
areddyyt | 1 year ago
There are two types of quantization (generally), post training quantization (PTQ) and quantization aware training (QAT).
PTQ almost always suffers from some kind of accuracy degradation. This is because usually the loss is measured with respect to the FP16/BF16 parameters, and so the weights and distribution are selected to minimize the loss with those weights. Once the quantization function is applied, the weights and distribution change in some way (even if it's by a tiny amount), resulting in your model no longer being at minima.
We do QAT to get around the problem of PTQ. We actually quantize the weights during the forward pass of training, and measure the loss with respect to the quantized weights. As a result, once we converge the model, we have converged the ternary weights as well, and the accuracy it achieved at the end of training is the accuracy of the quantized model. At ~3B parameters the accuracy on downstream task performance between FP16 and ternary weights is identical.
No comments yet.