top | item 31054167

(no title)

> we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem

I am surprised and a bit disappointed this paper does not mention mean field theory or dynamical isometry at all.

Mean field theory applies methods from physics - namely random matrix and free probability theory - to derive an exact analytical solution for information flow through a neural network.

It turns out that simply initializing the weights of a plain CNN using a delta-orthogonal kernel allows all frequency components (Fourier modes) to propagate through the network with minimal attenuation. Specifically, networks train well when their input-output Jacobians exhibit dynamical isometry, namely the property that the entire distribution of singular values is close to 1. This technique effectively solves the exploding/vanishing gradient problem.

The impact is shocking: the time to train a NN to a given accuracy becomes independent of network depth. No tricks like batch normalization, dropout, or anything else are needed. This insight has been proven for a wide range of architectures from plain FFNs to CNNs, RNNs, and even transformers.

I highly recommended reading the papers “How to Train a 10,000 Layer Neural Network” [1], and “ReZero is All You Need: Fast Convergence at Large Depth” [2].

[1] https://arxiv.org/abs/1806.05393

[2] https://proceedings.mlr.press/v161/bachlechner21a/bachlechne...

discuss

No comments yet.