(no title)
nodja | 2 months ago
The same way everyone else fails at it.
Change some hyper parameters to match the new hardware (more params), maybe implement the latest improvements in papers after it was validated in a smaller model run. Start training the big boy, loss looks good, 2 months and millions of dollars later loss plateaus, do the whole SFT/RL shebang, run benchmarks.
It's not much better than the previous model, very tiny improvements, oops.
yalok|2 months ago
thefourthchime|2 months ago
wrsh07|2 months ago
Many people thought it was an improvement though