top | item 20506165

(no title)

Thanks for taking efforts to explain this — more people should be aware of this interesting effect. I agree with all you have said except for the conclusion part. BN without L2 is not underfitting. In my experience it is overfitting due to small effective learning rate. It is easy to verify — just compare the gap between training and test losses / errors. So my conclusion from the derivation is, L2 penalty in BN acts as a regularizer in a different way — by increasing the effective learning rate of the weights. On a related note, this effect could present even without normalization. By adding just a scalar multiplier parameter in the branch, the weight’s scale could be more or less decoupled from its direction. For reference, I will make some shameless self-promotion here about our recent work on training residual networks without normalization: http://arxiv.org/abs/1901.09321

discuss

6gvONxR4sf7o|6 years ago

Wow. You straight up copy-pasted the top reddit comment on this article from 5 months ago [0]. Funny thing is that that the article mentions making corrections due to that comment (also 5 months ago) so your stolen comment isn't even relevant anymore.

Not cool.

[0] https://www.reddit.com/r/MachineLearning/comments/aler62/d_l...