top | item 46244679

(no title)

gpjt | 2 months ago

Thanks re: gradient accumulation, I'm glad to hear my intuition was right!

As part of the upcoming post I'm running the DDP train on A100s with 40 GiB and 80 GiB, H100s with 80 GiB, and B200s with 160 GiB, so I'll have at least three loss vs. batch size points to plot. So that might be interesting.

I guess a full test would be to train at various batch sizes on the 160 GiB machine and plot the resulting loss. That would be very expensive as a hobby project (the bs=64 train cost a bit more than $40 excluding overhead) so I won't do it.

But perhaps a shorter train would still be of value? That is, train for 300M tokens for a tenth of the cost and see where the loss landed? The problem with that would be if the impact of batch sizes varied with the length of the train, eg. if batch size 64 was better than 512 for short trains but weaker at longer ones.

discuss

spi|2 months ago

Yes exactly, I fear that shortening the training time would skew the results. In the very short term, smaller batch size is typically better just because you need a certain amount of gradient updates to move away from the original random, hence pretty terrible, weight distribution. Larger batch size gives a steadier, but slower, convergence, so it's hard to say for sure what is better for a given compute budget.

I'm definitely _not_ encouraging you on spending more money on a side topic just for the sake of optimizing this one parameter, there will always be another parameter after that that you'll feel an urge to optimize :-) I'd say it's already a pretty neat result to have come to a very close score to the original GPT2 training starting from scratch!

P.S. If you want to push it a bit further, rather than optimizing parameters for this model, last week at EurIPS I heard that a current "very good" modern repo to start from in order to train a good LLM is this: https://github.com/Niccolo-Ajroldi/plainLM. I haven't investigated this exactly (I'm not working on LLM), but it might be interesting to you for a sample run. The (N)EurIPS paper that was discussed at the conference claimed that the only important change to do was to modify the hyperparameters of the Adam optimizer, setting beta1=beta2=0.95 for example (the default values are beta1=0.9 and beta2=0.999 which are apparently outdated).

gpjt|2 months ago

Awesome, thanks! I'm still doing trains on the big machines right now (hopefully will write up over xmas) but I think once I've worked out the sweet spot for memgatokens per dollar for this model, it's time to start tweaking the other controls -- LR and cosine variation of it, as you said, and also dropout, bias, weight tying, and definitely gradient clipping (which should at least get better bang for the buck from time/$ spent). I'll leave it to Google to follow up Chinchilla with a "best batch size across a thousand trained models" paper ;-)