top | item 44899270 (no title) tootyskooty | 6 months ago I suspect one can go a lot further by adopting some tweaks from the GPT-2 speedrun effort [0], at minimum Muon, better init and carefully tuning learning rate.[0]: https://github.com/KellerJordan/modded-nanogpt discuss order hn newest No comments yet.
No comments yet.