top | item 43920756

(no title)

blt | 9 months ago

I agree with this part of your response: If we were to require that distributed training generates the exact same sequence of weight updates as serial SGD on a single machine, then we would need a barrier like that. However, there is a lot of research on distributed optimization that addresses this issue by relaxing the "exactly equivalent to serial SGD" requirement, including classic papers [1] and more recent ones [2].

Basically, there are two properties of ML optimization that save us: 1) the objective is a huge summation over many small objectives (the losses for each training data point), and 2) the same noise-robustness that makes SGD work in the first place can give us robustness against further noise caused by out-of-order updates.

So I think this issue can be overcome fairly easily. Does anyone know if the big LLM-training companies use asynchronous updates like [1,2]? Or do they still use a big barrier?

[1] https://proceedings.neurips.cc/paper_files/paper/2011/hash/2...

[2] https://arxiv.org/abs/2401.09135

discuss

sdenton4|9 months ago

I think there's been some tendency against things like HogWild in favor of reproducibility and (its close cousin) debug-friendliness.

blt|9 months ago

Understandable. However, I went down that rabbit hole once, and learned that even summing a 1D array with a GPU is nondeterministic (because it is broken down like merge sort and subject to the scheduler). I guess I assumed that practitioners had fully embraced randomness due to things like that.