(no title)
blt
|
9 months ago
I can see how such a phenomenon could happen at the level of a single machine, but if we're using a whole data center full of GPU machines it should be possible to spread out those spikes evenly over time. Still weird that the article implies spikiness is a fundamental property of AI workloads rather than a design oversight that can be fixed at the software level.
jfim|9 months ago
The only way to spread the spikes would be to make the training run slower, but that'd be a hard sell considering training can sometimes be measured in days.
blt|9 months ago
Basically, there are two properties of ML optimization that save us: 1) the objective is a huge summation over many small objectives (the losses for each training data point), and 2) the same noise-robustness that makes SGD work in the first place can give us robustness against further noise caused by out-of-order updates.
So I think this issue can be overcome fairly easily. Does anyone know if the big LLM-training companies use asynchronous updates like [1,2]? Or do they still use a big barrier?
[1] https://proceedings.neurips.cc/paper_files/paper/2011/hash/2...
[2] https://arxiv.org/abs/2401.09135