top | item 35982315

(no title)

crosen99 | 2 years ago

I'm surprised not to see anything about data-to-parameter ratios for optimal scaling. My superficial understanding per the Chinchilla paper is to target 20 to 1.

I'm also confused about this:

> ~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens

This is apparently related to the LLaMa paper, but that paper seems to cite 1.0T tokens (rather than 1.4T tokens) for the 13B model. Also, if 20 to 1 is in fact optimal for the data-to-parameter ratio, then using a 100 to 1 ratio doesn't seem like an appropriate way to arrive at a magic number for training costs. The magic number should really be based on an optimal configuration. Or, perhaps, my superficial understanding here leads me to miss some important distinctions.

discuss

llambada|2 years ago

The Chinchilla paper only addresses the contrived use case of a model that is trained once and never used for inference. Since most of the real world compute cost will be in inference, Chinchilla seems to offer little practical guidance.