It could be even smaller than a Chinchilla optimal model. The Chinchilla paper was about training the most capable models with the least training compute. If you are optimizing for capability and inference compute you can "over-train" by providing much more data per parameter than even Chinchilla, or you can train a larger model and then distill it to a smaller size. Increasing context size increases inference compute, but the increased capabilities of high context size might allow you to skimp on parameters and lead to a net decrease in compute. There's probably other strategies as well, but those are the ones I know of.
habitue|3 years ago