I'm not sure whether the number of parameters serves as a reliable measure of quality. I believe that these models have a lot of redundant computation and could be a lot smaller without losing quality.
The Chinchilla scaling law describes, apart from the training data size, the optimal number of parameters for a given amount of computing power for training. See
For training, yes, but these models are optimized for inference, since inference will be run many more times than training. The original Llama models were run way past chinchilla-optimal amounts of data.
cubefox|2 years ago
https://dynomight.net/scaling/
sp332|2 years ago