Could you explain why you think that? I'm looking at the lottery ticket section and it seems like he doesn't disown it; the reason he gives, via Abhinav, for not pursuing it at his commercial job is just that that kind of sparsity is not hardware friendly (except with Cerebras). "It doesn't provide a speedup for normal commercial workloads on normal commercial GPUs and that's why I'm not following it up at my commercial job and don't want to talk about it" seems pretty far from "disowning the lottery ticket hypothesis [as wrong or false]".
cool beans, thanks for this -- I think it's easier to hear it directly from the authors. I was hesitant to start researchposting and come off like a dick.
also; note to self: If I publish and disown my papers, shawn will interview me :)
What evidence against it do you have in mind? I think it's a result of little practical relevance without a way to identify winning tickets that doesn't require buying lots of tickets until you hit the jackpot (i.e. training a large, dense model to completion) but that doesn't make the observation itself incorrect.
The observation itself is also partially incorrect. This is a video I watched a few months ago that went further into the whole how do you deal with subnetworks thing.
At the timestamp they discuss how actually the original ICLR results only worked on these extremely tiny models and larger ones didn't work. The adaptation you need to sort of fix it is to train densely first for a few epochs, only then you can start increasing sparsity.
swyx|1 month ago
gwern|1 month ago
laughingcurve|1 month ago
also; note to self: If I publish and disown my papers, shawn will interview me :)
yorwba|1 month ago
kingstnap|1 month ago
https://youtu.be/WW1ksk-O5c0?list=PLCq6a7gpFdPgldPSBWqd2THZh... (timestamped)
At the timestamp they discuss how actually the original ICLR results only worked on these extremely tiny models and larger ones didn't work. The adaptation you need to sort of fix it is to train densely first for a few epochs, only then you can start increasing sparsity.