top | item 37079132

(no title)

SimplyUnknown | 2 years ago

First of all, great blog post with great examples. Reminds me of distill.pub used to be.

Second, the article correctly states that typically L2 weight decay is used, leading to a lot of weights with small magnitudes. For models that generalize better, would it then be better to always use L1 weight decay to promote sparsity in combination with longer training?

I wonder whether deep learning models that only use sparse fourier features rather than dense linear layers would work better...

discuss

order

medium_spicy|2 years ago

Short answer: if the inputs can be represented well on the Fourier basis, yes. I have a patent in process on this, fingers crossed.

Longer answer: deep learning models are usually trying to find the best nonlinear basis in which to represent inputs; if the inputs are well-represented (read that as: can be sparsely represented) in some basis known a-priori, it usually helps to just put them in that basis, e.g., by FFT’ing RF signals.

The challenge is that the overall-optimal basis might not be the same as those of any local minima, so you’ve got to do some tricks to nudge the network closer.

qumpis|2 years ago

Slightly related but sparsity-inducing activation function Relu is often used in neural networks