top | item 40081340

(no title)

Somewhere I read that the 8B llama2 model could be undertrained by 100-1000x. So is it possible to train a model with 8B/100 = 80M parameters to perform as good as the llama2 8B model, given enough training time and training tokens?

discuss

modeless|1 year ago

It's unclear. It might take a larger dataset than actually exists, or more compute than is practical. Or there may be a limit that we just haven't reached yet; this actually seems quite likely. The scaling "laws" are really more like guidelines and they are likely wrong when extrapolated too far.

pellucide|1 year ago

Thanks!