top | item 43550746

(no title)

(original article author)

I view it more as a shortcut. We have trained 7B and 14B models from scratch, matching transformer performance with similar sized datasets.

This has been shown to even slightly outperform transformer scaling law, with the training we done from 1B to 14B. And we expect it to do so as we scale.

However as of this point, answering and settling that debate for good at 72B scale - is a $5 Million dollar bill. So for now, we use the short cuts, to just show that it actually works - and use that money to iterate and improve the architecture faster.

discuss

inhumantsar|11 months ago

Thanks for the explanation! Sounds pretty exciting. I'll keep my eyes peeled for the paper