top | item 43911090

(no title)

blt | 9 months ago

I can see how such a phenomenon could happen at the level of a single machine, but if we're using a whole data center full of GPU machines it should be possible to spread out those spikes evenly over time. Still weird that the article implies spikiness is a fundamental property of AI workloads rather than a design oversight that can be fixed at the software level.

discuss

jfim|9 months ago

When running data parallel training, basically all the nodes taking part in the training run the same training loop in lockstep. So you'd have all nodes running the forward and backward passes on the GPU, then they'd wait for the gradient to be reduced across all nodes and then the weights get updated and another iteration can be run. For the first part the GPU is working, but when waiting on the network it's idle. The spikes are basically synchronous across all nodes doing the training.

The only way to spread the spikes would be to make the training run slower, but that'd be a hard sell considering training can sometimes be measured in days.

blt|9 months ago

I agree with this part of your response: If we were to require that distributed training generates the exact same sequence of weight updates as serial SGD on a single machine, then we would need a barrier like that. However, there is a lot of research on distributed optimization that addresses this issue by relaxing the "exactly equivalent to serial SGD" requirement, including classic papers [1] and more recent ones [2].

Basically, there are two properties of ML optimization that save us: 1) the objective is a huge summation over many small objectives (the losses for each training data point), and 2) the same noise-robustness that makes SGD work in the first place can give us robustness against further noise caused by out-of-order updates.

So I think this issue can be overcome fairly easily. Does anyone know if the big LLM-training companies use asynchronous updates like [1,2]? Or do they still use a big barrier?

[1] https://proceedings.neurips.cc/paper_files/paper/2011/hash/2...

[2] https://arxiv.org/abs/2401.09135