top | item 46257806

(no title)

spi | 2 months ago

Yes exactly, I fear that shortening the training time would skew the results. In the very short term, smaller batch size is typically better just because you need a certain amount of gradient updates to move away from the original random, hence pretty terrible, weight distribution. Larger batch size gives a steadier, but slower, convergence, so it's hard to say for sure what is better for a given compute budget.

I'm definitely _not_ encouraging you on spending more money on a side topic just for the sake of optimizing this one parameter, there will always be another parameter after that that you'll feel an urge to optimize :-) I'd say it's already a pretty neat result to have come to a very close score to the original GPT2 training starting from scratch!

P.S. If you want to push it a bit further, rather than optimizing parameters for this model, last week at EurIPS I heard that a current "very good" modern repo to start from in order to train a good LLM is this: https://github.com/Niccolo-Ajroldi/plainLM. I haven't investigated this exactly (I'm not working on LLM), but it might be interesting to you for a sample run. The (N)EurIPS paper that was discussed at the conference claimed that the only important change to do was to modify the hyperparameters of the Adam optimizer, setting beta1=beta2=0.95 for example (the default values are beta1=0.9 and beta2=0.999 which are apparently outdated).

discuss

gpjt|2 months ago

Awesome, thanks! I'm still doing trains on the big machines right now (hopefully will write up over xmas) but I think once I've worked out the sweet spot for memgatokens per dollar for this model, it's time to start tweaking the other controls -- LR and cosine variation of it, as you said, and also dropout, bias, weight tying, and definitely gradient clipping (which should at least get better bang for the buck from time/$ spent). I'll leave it to Google to follow up Chinchilla with a "best batch size across a thousand trained models" paper ;-)