top | item 21643337

Understanding the generalization of ‘lottery tickets’ in neural networks

98 points| hongzi | 6 years ago |ai.facebook.com | reply

16 comments

order
[+] gradys|6 years ago|reply
For anyone looking for a quick summary of what a lottery ticket is, how they're found, and some implications, here's what I remember from a I talk I saw about lottery tickets at EMNLP 2019:

- The core hypothesis is that within "over-parameterized"[0] networks, a small subnetwork (a small set of weights) is often doing most of the work. A weight can be thought of as an edge in the neural network graph, and so subsets of weights can be thought of as subnetworks.

- You can find these subnetworks by initializing weights randomly, training for some number of steps, identifying the least contributing weights, and then retraining from the same initial parameters as before, except with the least-contributing weights from the previous run zero'd out.

- People have observed that you can achieve something like 99% weight pruning with relatively little loss in performance. After that, things get very unstable.

- This has implications for understanding how neural nets do what they do and for shrinking model sizes.

This is all from memory, so forgive any errors.

[0] - Networks have grown to enormous numbers of parameters lately, and there's reason to think that even before the era of 1B+ parameter networks, neural nets were over-parameterized. Why do we use such large networks then? For a given trained neural net, there might be a much smaller one that does the same thing, but in practice, it's difficult to get the same performance by training a smaller network. This may be because our typical optimization methods aren't well suited for finding these lottery tickets.

[+] longtom|6 years ago|reply
Maybe the additional parameters give the entire network more "leeway" to find such subnetwork structures, i.e. ease gradient descent by smoothing the loss landscape?
[+] dmurray|6 years ago|reply
> Why do we use such large networks then? For a given trained neural net, there might be a much smaller one that does the same thing, but in practice, it's difficult to get the same performance by training a smaller network.

Also, because it's cheap to run a network with a billion parameters. Lots of real-world applications are constrained not by compute, but by size of the training set.

[+] bigred100|6 years ago|reply
Are the results different from just doing something like L1 regularization?
[+] 55555|6 years ago|reply
What does parameter mean in this context? Clearly we arent looking at 1B+ 'features' in order to see which are predictive, right? That seems infeasible.
[+] cbuq|6 years ago|reply
How is this different from a sparse autoencoder?
[+] edoo|6 years ago|reply
It seems like that should be expected. The point is to have a ton of data with possible patterns to find and you have no idea which data links to what other data. You'd expect most data to not correlate. You'd expect strong patterns to emerge between a subset of the parameters. You'd expect to be able to prune off the useless parameters after the fact.
[+] sandoooo|6 years ago|reply
The next question would be: is it possible to compare across multiple lottery ticket networks, and look for commonality? Maybe the lottery ticket params all give rise to some small set of particularly efficient configurations that we can add directly to other networks?
[+] closetCS|6 years ago|reply
I think the main question I have about these smaller, "lottery ticket" networks is that they are being trained over and over on the same problem as the bigger network, and are being evaluated on the same dataset as the big network, which leads me to believe that the model will fail to generalize to different but still related problems. Like if the model was trained on Imagenet an the winning ticket was found that had ridiculous accuracy for a relatively small network, I would think it would be heavily overfitted
[+] bonoboTP|6 years ago|reply
The article discusses this as its main theme and shows generalization across datasets.