Well, it's just like stochastic gradient descent, if you think about it. The normal gradient descent is computed using the whole training set. The stochastic gradient is trained on a batch (a subset of the training set), and in the distributed case, we compute two batches at once by doing the gradient on each in parallel.
The intuition works IMO, but indeed, having the first batch update and then the second, is not equal to having the mean update.This is indeed super cool !
eru|1 year ago
jey|1 year ago
bravura|1 year ago