top | item 44899270

(no title)

I suspect one can go a lot further by adopting some tweaks from the GPT-2 speedrun effort [0], at minimum Muon, better init and carefully tuning learning rate.

[0]: https://github.com/KellerJordan/modded-nanogpt

discuss

No comments yet.