(no title)
gpjt | 2 months ago
As part of the upcoming post I'm running the DDP train on A100s with 40 GiB and 80 GiB, H100s with 80 GiB, and B200s with 160 GiB, so I'll have at least three loss vs. batch size points to plot. So that might be interesting.
I guess a full test would be to train at various batch sizes on the 160 GiB machine and plot the resulting loss. That would be very expensive as a hobby project (the bs=64 train cost a bit more than $40 excluding overhead) so I won't do it.
But perhaps a shorter train would still be of value? That is, train for 300M tokens for a tenth of the cost and see where the loss landed? The problem with that would be if the impact of batch sizes varied with the length of the train, eg. if batch size 64 was better than 512 for short trains but weaker at longer ones.
spi|2 months ago
I'm definitely _not_ encouraging you on spending more money on a side topic just for the sake of optimizing this one parameter, there will always be another parameter after that that you'll feel an urge to optimize :-) I'd say it's already a pretty neat result to have come to a very close score to the original GPT2 training starting from scratch!
P.S. If you want to push it a bit further, rather than optimizing parameters for this model, last week at EurIPS I heard that a current "very good" modern repo to start from in order to train a good LLM is this: https://github.com/Niccolo-Ajroldi/plainLM. I haven't investigated this exactly (I'm not working on LLM), but it might be interesting to you for a sample run. The (N)EurIPS paper that was discussed at the conference claimed that the only important change to do was to modify the hyperparameters of the Adam optimizer, setting beta1=beta2=0.95 for example (the default values are beta1=0.9 and beta2=0.999 which are apparently outdated).
gpjt|2 months ago