top | item 42555421

(no title)

d3m0t3p | 1 year ago

Well, it's just like stochastic gradient descent, if you think about it. The normal gradient descent is computed using the whole training set. The stochastic gradient is trained on a batch (a subset of the training set), and in the distributed case, we compute two batches at once by doing the gradient on each in parallel. The intuition works IMO, but indeed, having the first batch update and then the second, is not equal to having the mean update.

This is indeed super cool !

discuss

eru|1 year ago

Does anyone actually use the 'normal gradient descent' with the whole training set? I only ever see it as a sort of straw man to make explanation easier.

jey|1 year ago

Generally yes, vanilla gradient descent gets plenty of use. But for LLMs: no, it’s not really used, and stochastic gradient descent provides a form of regularization, so it probably works better in addition to being more practical.

bravura|1 year ago

Full batch with L-BFGS, when possible, is wildly underappreciated.