top | item 42559524

(no title)

igorkraw | 1 year ago

A few technical questions (I had a somewhat related work with friends here https://openreview.net/forum?id=I3HCE7Ro78H although we focused on gradient multiplicity in adversarial training, not massively parallel training)

1. Do you think this is a form of variance reduction or more a form of curriculum (focus first on the bulk, then on remaining errors)? 2. Did you observe any overfitting/additional adversarial risk? 3. Did you try this on just single-node minibatches as well? How did that perform?

discuss

order

deddy|1 year ago

> 1. Do you think this is a form of variance reduction or more a form of curriculum (focus first on the bulk, then on remaining errors)?

I'd say generally more of a cirriculum (using your terminology). Broadly speaking the idea is to restrict stepping to "high-quality" directions where there is agreement/consistency in the direction of update.

> 2. Did you observe any overfitting/additional adversarial risk?

No, actually one of the coolest findings of the work is that when training with GAF we have found that it prevents overfitting. It might slow/down stop training improvement, but it also prevents overfitting. Essentially what happens in late training when overfitting would occur the gradient directions become orthogonal, when that occurs GAF instead means you just don't take a step. Training ends up plateauing as it becomes harder to find two minibatches that have agreement so you end up having more no-op epochs, but you don't overfit. I think we still have one training run going (after months) on CIFAR-100N-Fine that has yet to overfit. It's still slowly improving, last time we check train and val were both around ~60%.

Adversarial risk is an interesting question, but this should help with that as well provided that the adversarial examples are a minority of the training data and that the adversarial attack comes from overfitting / memorizing the adversarial part of those examples.

> 3. Did you try this on just single-node minibatches as well? How did that perform?

The number of nodes is more of a performance implementation detail in terms of to what extent/scale you parallelsize. For the technique to work you just need to 2+ macrobatches that you can compare to determine to take your step. CIFAR-100-N is small enough that you can run multiple minibatches on a single GPU (node) and it all fits into VRAM. Even it it didn't fit into VRAM you could theoretically save off the gradient to disk before taking a step and the technique would still apply/help/work, it would just be slower.

Yes