(no title)
crosen99 | 2 years ago
I'm also confused about this:
> ~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens
This is apparently related to the LLaMa paper, but that paper seems to cite 1.0T tokens (rather than 1.4T tokens) for the 13B model. Also, if 20 to 1 is in fact optimal for the data-to-parameter ratio, then using a 100 to 1 ratio doesn't seem like an appropriate way to arrive at a magic number for training costs. The magic number should really be based on an optimal configuration. Or, perhaps, my superficial understanding here leads me to miss some important distinctions.
llambada|2 years ago