(no title)
blt | 9 months ago
Basically, there are two properties of ML optimization that save us: 1) the objective is a huge summation over many small objectives (the losses for each training data point), and 2) the same noise-robustness that makes SGD work in the first place can give us robustness against further noise caused by out-of-order updates.
So I think this issue can be overcome fairly easily. Does anyone know if the big LLM-training companies use asynchronous updates like [1,2]? Or do they still use a big barrier?
[1] https://proceedings.neurips.cc/paper_files/paper/2011/hash/2...
sdenton4|9 months ago
blt|9 months ago