(no title)
pellucide | 1 year ago
>We made several new observations on scaling behavior during the development of Llama 3. For example, while the Chinchilla-optimal amount of training compute for an 8B parameter model corresponds to ~200B tokens, we found that model performance continues to improve even after the model is trained on two orders of magnitude more data. Both our 8B and 70B parameter models continued to improve log-linearly after we trained them on up to 15T tokens. Larger models can match the performance of these smaller models with less training compute, but smaller models are generally preferred because they are much more efficient during inference.
Can someone experienced please explain this. Does this mean, a lean model with more training time and/or more (or better) training data will perform better than a fat model?
modeless|1 year ago
"Chinchilla-optimal" is about choosing model size and/or dataset size to maximize the accuracy of your model under a fixed training budget (fixed number of floating point operations). For a given dataset size it will tell you the model size to use, and vice versa, again under the assumption of a fixed training budget.
However, what people have realized is that inference compute matters at least as much as training compute. You want to optimize training and inference cost together, not in isolation. Training a smaller model means your accuracy will not be as good as it could have been with a larger model using the same training budget, however you'll more than make it up in your inference budget. So in most real world cases it doesn't make sense to be "Chinchilla-optimal".
What Meta is saying here is that there is no accuracy ceiling. You can keep increasing training budget and dataset size to increase accuracy seemingly indefinitely (with diminishing returns). At least as far as they have explored.
HarHarVeryFunny|1 year ago
Meta have a massive user base, and if they are using these models to run their own business, then that implies massive inference volume, and that it might make economic sense for them to put more money into training (to make smaller/cheaper models more powerful) than for other companies with lower inference volume.
To put it another way, it'd not be surprising - if their internal use of these models is very high - to see Meta continuing to release models that size for size beat the competition since they were incentivized to pump more tokens through them during training.
pellucide|1 year ago
hnav|1 year ago