top | item 20506165

(no title)

Vizarddesky | 6 years ago

Thanks for taking efforts to explain this — more people should be aware of this interesting effect. I agree with all you have said except for the conclusion part. BN without L2 is not underfitting. In my experience it is overfitting due to small effective learning rate. It is easy to verify — just compare the gap between training and test losses / errors. So my conclusion from the derivation is, L2 penalty in BN acts as a regularizer in a different way — by increasing the effective learning rate of the weights. On a related note, this effect could present even without normalization. By adding just a scalar multiplier parameter in the branch, the weight’s scale could be more or less decoupled from its direction. For reference, I will make some shameless self-promotion here about our recent work on training residual networks without normalization: http://arxiv.org/abs/1901.09321

discuss

order