I just checked huggingface and the model files download is about 25GB but in a comment below someone mentioned it is 8fp quantized model. Trying to understand how the quantization affects the model (and RAM) size. Can someone please enlighten.
Sure. The talk about 8bit refers to quantization-aware training. Pretty common in image models these days to reduce the impact of quantization on accuracy.
Typically this might mean that you simulate an 8bit forward pass to ensure that the model is robust to quantization ‘noise’. You still use FP16/32 for backward pass & weight updates for numerical stability.
It’s just a way to optimize the model in anticipation of future quantization. The experience of using an 8-bit Nemo quant should more closely mirror that of using the full-fat bf16 model compared to if they hadn’t used QAT.
frontierkodiak|1 year ago
Typically this might mean that you simulate an 8bit forward pass to ensure that the model is robust to quantization ‘noise’. You still use FP16/32 for backward pass & weight updates for numerical stability.
It’s just a way to optimize the model in anticipation of future quantization. The experience of using an 8-bit Nemo quant should more closely mirror that of using the full-fat bf16 model compared to if they hadn’t used QAT.