(no title)
deddy | 1 year ago
With gradient agreement filtering having a greater number of batches (generally) increases the likelihood of finding another microbatch that agrees with the gradient simply by virtue of having more gradient "samples" to compare. So having more batches increases the chance of success there. The algorithm as laid out in the paper is a simple approach to combining groups of batches where larger numbers batches doesn't necessarily improve you chances of success if the batch you're comparing against is an outlier itself. There are almost certainly better ways of combining greater numbers of batches to get a successful update. This is one of the exciting areas of future work.
Increasing the batch size generally can be though of as "averaging" out the noise in your samples to find a consistent update. This has an interesting affect though where that as you increase the batch size, when using gradient agreement filtering you want to lower the filter threshold as your batch size increases to become "stricter" in terms of the level of agreement to look for to accept a gradient. This is Figure 9 of the paper. This is also consistent with what other researchers have found that simply increasing batch size isn't always better. There is a trade here with diminishing returns of increasing batch size as well (https://arxiv.org/abs/1812.06162). One interesting finding from the work was that smaller batch sizes actually improved training accuracy for CIFAR-100N, very roughly speaking this can be explained by having more "signal" in each batch with smaller batch sizes, at the cost of potentially throwing out batches/gradients if they disagree.
No comments yet.