(no title)
bjterry | 1 year ago
> The single number that should summarize your expectations about any LLM is the number of total flops that went into its training.
One thing I've been curious about is whether a model that's trained well beyond the Chinchilla level of compute will suffer more from quantization. All of that information has to live somewhere within the weights, so it stands to reason that you may have to keep more bits of information to keep that performance benefit.
If so, it would also mean that a smaller model that's been "overtrained," but which can't be quantized without suffering quality loss isn't necessarily cheaper for inference than a larger model which isn't overtrained, but which can be aggressively quantized. I haven't seen anyone discuss this, but maybe there's a paper on it.
If you could characterize what level of overtraining leads to quality loss at different levels of quantization, you could possibly figure out a more optimal model for overtraining. E.g. if you train with 10T tokens and you see quality loss at 4 bit, and you train with 20T tokens and see quality loss at 6 bit, you can fit a curve to those data points to estimate the maximum amount of tokens the model can train on with the current methodology.
No comments yet.