(no title)
pico_creator | 11 months ago
I view it more as a shortcut. We have trained 7B and 14B models from scratch, matching transformer performance with similar sized datasets.
This has been shown to even slightly outperform transformer scaling law, with the training we done from 1B to 14B. And we expect it to do so as we scale.
However as of this point, answering and settling that debate for good at 72B scale - is a $5 Million dollar bill. So for now, we use the short cuts, to just show that it actually works - and use that money to iterate and improve the architecture faster.
inhumantsar|11 months ago