top | item 22509633

Proving the Lottery Ticket Hypothesis: Pruning is All You Need

174 points| che_shr_cat | 6 years ago |arxiv.org | reply

61 comments

order
[+] stared|6 years ago|reply
The lottery ticket hypothesis is IMHO the single most interesting finding for deep learning. It explains why does deep learning works (vs shallow neural nets), why is initial over-parametrization is often useful, why deeper is often better than shallow, etc.

I recommend for an overview:

- the original paper "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks", https://arxiv.org/abs/1803.03635

- "Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask" https://eng.uber.com/deconstructing-lottery-tickets/ showing that if we remove "non-winning tickets" before the training, the trained network still works well

[+] jl2718|6 years ago|reply
This is really not a new idea for anybody that has studied numerical optimization or done a lot of much simpler linear modeling. It has to do with the gradients, which are basically random steps in high-dimensional spaces unless you align with a ‘large’ eigenvector of the Hessian. This explanation didn’t go over well at all with the ML cargo cultists until someone gave it a name. Interesting how things work.
[+] vecter|6 years ago|reply
Can you summarize and explain those things (why deep learning works, why over-parameterization is useful, and why deeper is better than shallower) assuming an undergraduate-level understanding of mathematics and computer science?
[+] xt00|6 years ago|reply
If “pruning is all you need” that does feel like a way of explaining how intelligence could come out of a mass of neurons such as our brain. Or at least that sounds like a thing that makes it understandable to me. Basically add a bunch of connections relatively randomly, start pruning slowly until you hit a point where the system changes... I’ll keep hand waving until somebody who knows this stuff can chime in.. :)
[+] Buttons840|6 years ago|reply
In psychology class the professor told me that the number of connections in a human brain increases only two times in life, during infancy and during adolecense, at all other time the number of connections is decreasing.
[+] perl4ever|6 years ago|reply
How do you sculpt an elephant from a block of stone? Just chip away everything that doesn't look like an elephant...
[+] IX-103|6 years ago|reply
This is really neat and has a lot of implications for porting larger models to limited platforms like mobile. Unfortunately you still have to train the larger network, so the gains are somewhat limited. Some other papers I read show that you might be able to prune the network in the middle of training, which would make larger models more practical to work with.
[+] ekelsen|6 years ago|reply
The implications are unclear to me. We already know how to prune models for inference. For example https://arxiv.org/abs/1710.01878, along with earlier work (and more recent work). There's also work showing that you can take advantage of the sparsity to achieve practical speed gains: https://arxiv.org/abs/1911.09723.

We can also train networks that are sparse from the beginning of training (without requiring any special knowledge of the solution): https://arxiv.org/abs/1911.11134. It remains to be shown that this can be done with a speed advantage.

[+] rubyn00bie|6 years ago|reply
Am I understanding this right? Surely, I must be missing the entire point because...

This looks like to me, adding more and more bullshit to a model while managing to increase its accuracy, eventually leads to a "smaller" model with less bullshit?

That is to say, adding correlated or endogenous variables to a model (over-parameterization), so long as it increases its accuracy, will one day yield, a smaller, more optimized, model with less variables?

If so; why is this news? Isn't this like the fundamental process of most statistics and optimization problems? Or like isn't adding more data (when available) a fundamental method of solving/fixing with multicolinearity?

[+] nil-sec|6 years ago|reply
I think you do misunderstand. They do not add “correlated variables” to a model. The idea is that if you have an overparameterised model for a specific problem, this model contains a smaller model, that has similar performance to the trained large model, without training! That means gradient descent is in fact equivalent to pruning weights in a random network. There is no algorithm for how to do this efficiently (as they show) but that does not mean that there are no (so far unknown) heuristics out there that would get you close. This is exciting as it means a potential alternative for backprop is out there. This would be cool because it might mean more efficient algorithms and something I haven’t seen mentioned in the paper, an alternative to backprop that might be easier to understand in a biologically plausible way.
[+] bo1024|6 years ago|reply
I have a question. They show that any given depth-ell network, computing F, is w.h.p. approximated by some subnetwork of a random depth-2ell network.

But there is a theorem that even depth-2 networks can approximate any continuous function F. If the assumptions were the same, then their theorem would imply any continuous function F is w.h.p. approximated by some subnetwork of a depth-4 network.

So what is the difference in assumptions, i.e. what’s the significance of F being computed by a depth-ell network? What functions can a depth-ell+1 network approximate that a depth-ell network can’t? I’d guess it has to do with Lipschitz assumptions and bounded parameters but would be awesome if someone can clarify!

[+] orange3xchicken|6 years ago|reply
The theorem you mention is true for networks whose widths tend towards infinity.

This paper assumes a nn is given with fixed width n and fixed depth l. The main result is that there exists a subnetwork of a nn with depth 2l and width polynomial in n and l that can approximate it arbitrarily well.

[+] anonymousDan|6 years ago|reply
As a potentially naive thought experiment, if you just generated in advance a number of random networks of similar size to the pruned lottery ticket, and then trained them all in parallel, would you eventually find a lottery ticket? If so how many would you have to train to find a lottery ticket with high probability? Why is training one big network and then pruning any better than training lots of different smaller network? Assume in all of the above that you have a rough idea of how big the pruned network will be be.
[+] aerodude|6 years ago|reply
The number of subgraphs increases exponentially with the number of additional layers (and neurons). If you started off with a network the size of the final pruned network, you would have a dramatically lower chance of finding a winning ticket compared to the oversized network you start with.
[+] ludamad|6 years ago|reply
I would be potentially naively curious as well
[+] tells|6 years ago|reply
ELI5 someone please.
[+] m0zg|6 years ago|reply
So in other words, a sufficiently large set of monkeys with typewriters contains a subset which approximates the works of Shakespeare.
[+] lonelappde|6 years ago|reply
This paper formally proves what everyone already intuitively knows, right?

It's mathematically interesting, but not a practical advance.

[+] zackmorris|6 years ago|reply
I've always felt the there is a deep connection between evolution and thought, or more specifically, genetic algorithms (GAs) and neural networks (NNs).

The state of the art when I started following AI in the late 90s was random weights and hyper-parameters chosen with a GA, then optimized with NN hill climbing to find the local maximum. Looks like the research has continued:

https://www.google.com/search?q=genetic+algorithm+neural+net...

All I'm saying is that since we're no longer compute-bound, I'd like to see more big-picture thinking. We're so obsessed with getting 99% accuracy on some pattern-matching test that we completely miss other options, like in this case that effective subnetworks can evolve within a larger system of networks.

I'd like to see a mathematical proof showing that these and all other approaches to AI like simulated annealing are (or can be made) equivalent. Sort of like a Church–Turing thesis for machine learning:

https://en.wikipedia.org/wiki/Church–Turing_thesis

If we had this, then we could use higher-level abstractions and substitute simpler algorithms (like GAs) for the harder ones (like NNs) and not get so lost in the minutia and terminology. Once we had working solutions, we could analyze them and work backwards to covert them to their optimized/complex NN equivalents.

An analogy for this would be solving problems in our heads with simpler/abstract methods like spreadsheets, functional programming and higher-order functions. Then translating those solutions to whatever limited/verbose imperative languages we have to use for our jobs.

Edit: I should have said "NN gradient descent to find the local minimum" but hopefully my point still stands.

Edit 2: I should clarify that in layman's terms, Church-Turing says "every effectively calculable function is a computable function" so functional programming and imperative programming can solve the same problems, be used interchangeably and even be converted from one form to the other.

[+] sgillen|6 years ago|reply
> All I'm saying is that since we're no longer compute-bound, I'd like to see more big-picture thinking. We're so obsessed with getting 99% accuracy on some pattern-matching test that we completely miss other options.

There are so many people working in this field now, you can be sure a lot of them are doing big picture thinking.

> I'd like to see a mathematical proof showing that these and all other approaches to AI like simulated annealing are (or can be made) equivalent. Sort of like a Church–Turing thesis for machine learning:

Maybe I’m misunderstanding what you are saying, but I think the different optimization techniques/Metaheuristics you’re talking do actually have provably different properties.

[+] p1esk|6 years ago|reply
What do you mean we're no longer compute bound? Why don't you try replicating Microsoft's Turing-NLG 17B parameter model: "... trains with only 256 NVIDIA GPUs compared to 1024 NVIDIA GPUs needed by using Megatron-LM alone."