top | item 34988052

(no title)

It could be even smaller than a Chinchilla optimal model. The Chinchilla paper was about training the most capable models with the least training compute. If you are optimizing for capability and inference compute you can "over-train" by providing much more data per parameter than even Chinchilla, or you can train a larger model and then distill it to a smaller size. Increasing context size increases inference compute, but the increased capabilities of high context size might allow you to skimp on parameters and lead to a net decrease in compute. There's probably other strategies as well, but those are the ones I know of.

discuss

habitue|3 years ago

Ah! Interesting, I thought the capability was capped by parameters, but you're saying you can keep getting more capability from a fixed parameter size by continuing to train past what the chinchilla paper specifies. That's really cool