The lottery ticket hypothesis is IMHO the single most interesting finding for deep learning. It explains why does deep learning works (vs shallow neural nets), why is initial over-parametrization is often useful, why deeper is often better than shallow, etc.
I recommend for an overview:
- the original paper "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks", https://arxiv.org/abs/1803.03635
- "Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask" https://eng.uber.com/deconstructing-lottery-tickets/ showing that if we remove "non-winning tickets" before the training, the trained network still works well
This is really not a new idea for anybody that has studied numerical optimization or done a lot of much simpler linear modeling. It has to do with the gradients, which are basically random steps in high-dimensional spaces unless you align with a ‘large’ eigenvector of the Hessian. This explanation didn’t go over well at all with the ML cargo cultists until someone gave it a name. Interesting how things work.
I agree. I found Jonathan Frankle’s interview on the NLP highlights podcast to be a great explanation of the findings, and provided some really good intuitions for how deep learning behaves.
Can you summarize and explain those things (why deep learning works, why over-parameterization is useful, and why deeper is better than shallower) assuming an undergraduate-level understanding of mathematics and computer science?
If “pruning is all you need” that does feel like a way of explaining how intelligence could come out of a mass of neurons such as our brain. Or at least that sounds like a thing that makes it understandable to me. Basically add a bunch of connections relatively randomly, start pruning slowly until you hit a point where the system changes... I’ll keep hand waving until somebody who knows this stuff can chime in.. :)
In psychology class the professor told me that the number of connections in a human brain increases only two times in life, during infancy and during adolecense, at all other time the number of connections is decreasing.
This is really neat and has a lot of implications for porting larger models to limited platforms like mobile. Unfortunately you still have to train the larger network, so the gains are somewhat limited. Some other papers I read show that you might be able to prune the network in the middle of training, which would make larger models more practical to work with.
The implications are unclear to me. We already know how to prune models for inference. For example https://arxiv.org/abs/1710.01878, along with earlier work (and more recent work). There's also work showing that you can take advantage of the sparsity to achieve practical speed gains: https://arxiv.org/abs/1911.09723.
We can also train networks that are sparse from the beginning of training (without requiring any special knowledge of the solution): https://arxiv.org/abs/1911.11134. It remains to be shown that this can be done with a speed advantage.
Am I understanding this right? Surely, I must be missing the entire point because...
This looks like to me, adding more and more bullshit to a model while managing to increase its accuracy, eventually leads to a "smaller" model with less bullshit?
That is to say, adding correlated or endogenous variables to a model (over-parameterization), so long as it increases its accuracy, will one day yield, a smaller, more optimized, model with less variables?
If so; why is this news? Isn't this like the fundamental process of most statistics and optimization problems? Or like isn't adding more data (when available) a fundamental method of solving/fixing with multicolinearity?
I think you do misunderstand. They do not add “correlated variables” to a model. The idea is that if you have an overparameterised model for a specific problem, this model contains a smaller model, that has similar performance to the trained large model, without training! That means gradient descent is in fact equivalent to pruning weights in a random network. There is no algorithm for how to do this efficiently (as they show) but that does not mean that there are no (so far unknown) heuristics out there that would get you close. This is exciting as it means a potential alternative for backprop is out there. This would be cool because it might mean more efficient algorithms and something I haven’t seen mentioned in the paper, an alternative to backprop that might be easier to understand in a biologically plausible way.
I have a question. They show that any given depth-ell network, computing F, is w.h.p. approximated by some subnetwork of a random depth-2ell network.
But there is a theorem that even depth-2 networks can approximate any continuous function F. If the assumptions were the same, then their theorem would imply any continuous function F is w.h.p. approximated by some subnetwork of a depth-4 network.
So what is the difference in assumptions, i.e. what’s the significance of F being computed by a depth-ell network? What functions can a depth-ell+1 network approximate that a depth-ell network can’t? I’d guess it has to do with Lipschitz assumptions and bounded parameters but would be awesome if someone can clarify!
The theorem you mention is true for networks whose widths tend towards infinity.
This paper assumes a nn is given with fixed width n and fixed depth l. The main result is that there exists a subnetwork of a nn with depth 2l and width polynomial in n and l that can approximate it arbitrarily well.
As a potentially naive thought experiment, if you just generated in advance a number of random networks of similar size to the pruned lottery ticket, and then trained them all in parallel, would you eventually find a lottery ticket? If so how many would you have to train to find a lottery ticket with high probability? Why is training one big network and then pruning any better than training lots of different smaller network? Assume in all of the above that you have a rough idea of how big the pruned network will be be.
The number of subgraphs increases exponentially with the number of additional layers (and neurons). If you started off with a network the size of the final pruned network, you would have a dramatically lower chance of finding a winning ticket compared to the oversized network you start with.
I've always felt the there is a deep connection between evolution and thought, or more specifically, genetic algorithms (GAs) and neural networks (NNs).
The state of the art when I started following AI in the late 90s was random weights and hyper-parameters chosen with a GA, then optimized with NN hill climbing to find the local maximum. Looks like the research has continued:
All I'm saying is that since we're no longer compute-bound, I'd like to see more big-picture thinking. We're so obsessed with getting 99% accuracy on some pattern-matching test that we completely miss other options, like in this case that effective subnetworks can evolve within a larger system of networks.
I'd like to see a mathematical proof showing that these and all other approaches to AI like simulated annealing are (or can be made) equivalent. Sort of like a Church–Turing thesis for machine learning:
If we had this, then we could use higher-level abstractions and substitute simpler algorithms (like GAs) for the harder ones (like NNs) and not get so lost in the minutia and terminology. Once we had working solutions, we could analyze them and work backwards to covert them to their optimized/complex NN equivalents.
An analogy for this would be solving problems in our heads with simpler/abstract methods like spreadsheets, functional programming and higher-order functions. Then translating those solutions to whatever limited/verbose imperative languages we have to use for our jobs.
Edit: I should have said "NN gradient descent to find the local minimum" but hopefully my point still stands.
Edit 2: I should clarify that in layman's terms, Church-Turing says "every effectively calculable function is a computable function" so functional programming and imperative programming can solve the same problems, be used interchangeably and even be converted from one form to the other.
> All I'm saying is that since we're no longer compute-bound, I'd like to see more big-picture thinking. We're so obsessed with getting 99% accuracy on some pattern-matching test that we completely miss other options.
There are so many people working in this field now, you can be sure a lot of them are doing big picture thinking.
> I'd like to see a mathematical proof showing that these and all other approaches to AI like simulated annealing are (or can be made) equivalent. Sort of like a Church–Turing thesis for machine learning:
Maybe I’m misunderstanding what you are saying, but I think the different optimization techniques/Metaheuristics you’re talking do actually have provably different properties.
What do you mean we're no longer compute bound? Why don't you try replicating Microsoft's Turing-NLG 17B parameter model: "... trains with only 256 NVIDIA GPUs compared to 1024 NVIDIA GPUs needed by using Megatron-LM alone."
[+] [-] stared|6 years ago|reply
I recommend for an overview:
- the original paper "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks", https://arxiv.org/abs/1803.03635
- "Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask" https://eng.uber.com/deconstructing-lottery-tickets/ showing that if we remove "non-winning tickets" before the training, the trained network still works well
[+] [-] jl2718|6 years ago|reply
[+] [-] mamp|6 years ago|reply
https://podcasts.apple.com/us/podcast/101-the-lottery-ticket...
[+] [-] vecter|6 years ago|reply
[+] [-] xt00|6 years ago|reply
[+] [-] Buttons840|6 years ago|reply
[+] [-] perl4ever|6 years ago|reply
[+] [-] lonelappde|6 years ago|reply
[+] [-] IX-103|6 years ago|reply
[+] [-] ekelsen|6 years ago|reply
We can also train networks that are sparse from the beginning of training (without requiring any special knowledge of the solution): https://arxiv.org/abs/1911.11134. It remains to be shown that this can be done with a speed advantage.
[+] [-] rubyn00bie|6 years ago|reply
This looks like to me, adding more and more bullshit to a model while managing to increase its accuracy, eventually leads to a "smaller" model with less bullshit?
That is to say, adding correlated or endogenous variables to a model (over-parameterization), so long as it increases its accuracy, will one day yield, a smaller, more optimized, model with less variables?
If so; why is this news? Isn't this like the fundamental process of most statistics and optimization problems? Or like isn't adding more data (when available) a fundamental method of solving/fixing with multicolinearity?
[+] [-] nil-sec|6 years ago|reply
[+] [-] bo1024|6 years ago|reply
But there is a theorem that even depth-2 networks can approximate any continuous function F. If the assumptions were the same, then their theorem would imply any continuous function F is w.h.p. approximated by some subnetwork of a depth-4 network.
So what is the difference in assumptions, i.e. what’s the significance of F being computed by a depth-ell network? What functions can a depth-ell+1 network approximate that a depth-ell network can’t? I’d guess it has to do with Lipschitz assumptions and bounded parameters but would be awesome if someone can clarify!
[+] [-] orange3xchicken|6 years ago|reply
This paper assumes a nn is given with fixed width n and fixed depth l. The main result is that there exists a subnetwork of a nn with depth 2l and width polynomial in n and l that can approximate it arbitrarily well.
[+] [-] anonymousDan|6 years ago|reply
[+] [-] aerodude|6 years ago|reply
[+] [-] ludamad|6 years ago|reply
[+] [-] tells|6 years ago|reply
[+] [-] m0zg|6 years ago|reply
[+] [-] lonelappde|6 years ago|reply
It's mathematically interesting, but not a practical advance.
[+] [-] zackmorris|6 years ago|reply
The state of the art when I started following AI in the late 90s was random weights and hyper-parameters chosen with a GA, then optimized with NN hill climbing to find the local maximum. Looks like the research has continued:
https://www.google.com/search?q=genetic+algorithm+neural+net...
All I'm saying is that since we're no longer compute-bound, I'd like to see more big-picture thinking. We're so obsessed with getting 99% accuracy on some pattern-matching test that we completely miss other options, like in this case that effective subnetworks can evolve within a larger system of networks.
I'd like to see a mathematical proof showing that these and all other approaches to AI like simulated annealing are (or can be made) equivalent. Sort of like a Church–Turing thesis for machine learning:
https://en.wikipedia.org/wiki/Church–Turing_thesis
If we had this, then we could use higher-level abstractions and substitute simpler algorithms (like GAs) for the harder ones (like NNs) and not get so lost in the minutia and terminology. Once we had working solutions, we could analyze them and work backwards to covert them to their optimized/complex NN equivalents.
An analogy for this would be solving problems in our heads with simpler/abstract methods like spreadsheets, functional programming and higher-order functions. Then translating those solutions to whatever limited/verbose imperative languages we have to use for our jobs.
Edit: I should have said "NN gradient descent to find the local minimum" but hopefully my point still stands.
Edit 2: I should clarify that in layman's terms, Church-Turing says "every effectively calculable function is a computable function" so functional programming and imperative programming can solve the same problems, be used interchangeably and even be converted from one form to the other.
[+] [-] sgillen|6 years ago|reply
There are so many people working in this field now, you can be sure a lot of them are doing big picture thinking.
> I'd like to see a mathematical proof showing that these and all other approaches to AI like simulated annealing are (or can be made) equivalent. Sort of like a Church–Turing thesis for machine learning:
Maybe I’m misunderstanding what you are saying, but I think the different optimization techniques/Metaheuristics you’re talking do actually have provably different properties.
[+] [-] p1esk|6 years ago|reply