(no title)
criticaltinker | 3 years ago
I am surprised and a bit disappointed this paper does not mention mean field theory or dynamical isometry at all.
Mean field theory applies methods from physics - namely random matrix and free probability theory - to derive an exact analytical solution for information flow through a neural network.
It turns out that simply initializing the weights of a plain CNN using a delta-orthogonal kernel allows all frequency components (Fourier modes) to propagate through the network with minimal attenuation. Specifically, networks train well when their input-output Jacobians exhibit dynamical isometry, namely the property that the entire distribution of singular values is close to 1. This technique effectively solves the exploding/vanishing gradient problem.
The impact is shocking: the time to train a NN to a given accuracy becomes independent of network depth. No tricks like batch normalization, dropout, or anything else are needed. This insight has been proven for a wide range of architectures from plain FFNs to CNNs, RNNs, and even transformers.
I highly recommended reading the papers “How to Train a 10,000 Layer Neural Network” [1], and “ReZero is All You Need: Fast Convergence at Large Depth” [2].
[1] https://arxiv.org/abs/1806.05393
[2] https://proceedings.mlr.press/v161/bachlechner21a/bachlechne...
No comments yet.