top | item 43394866

(no title)

singulargalaxy | 11 months ago

Hard disagree. Your link relies on gradient descent as an explanation, whereas OP explains why optimization is not needed to understand DL generalization. PAC-Bayes, and the other different countable hypothesis bounds in OP also are quite divergent from VC dimension. The whole point of OP seems to be that these other frameworks, unlike VC dimension, can explain generalization with an arbitrarily flexible hypothesis space.

discuss

order

cgdl|11 months ago

Yes, and that's the problem. What Zhang et al [2] showed convincingly in the Rethinking paper is that just focusing on the hypothesis space cannot be enough since the same hypothesis space fits real and random data so it's already too large. Therefore, these methods that focus on the hypothesis space have to talk about a bias in practice towards a better subspace, and that already requires studying the specific optimization algorithm in order to understand why it picks certain hypothesis over others in the space.

But once you are ready to do that then algorithmic stability is enough. You don't then need to think about Bayesian ensembles, or other proxies/simplifications etc. but can focus on just the specific learning setup you have. BTW algorithmic stability is not a new idea. An early version showed up within a few years of VC theory in the 80s in order to understand why nearest neighbors generalizes (it wasn't called algorithmic stability then though).

If you are interested in this, also recommend [3].

[2] https://arxiv.org/abs/1611.03530

[3] https://arxiv.org/abs/1902.04742

singulargalaxy|11 months ago

But it's not a problem, it's actually a good thing that OP's explanation is more general. One of the main points in the OP paper is that you do not in fact need proxies or simplification. You can derive generalization bounds that do explain this behavior, without needing to rely on optimization dynamics. This exactly responds to the tests set forth in Zhang et al. OP does not "rely on Bayesian ensembles, or other proxies/simplifications". That seems to be a misunderstanding of the paper. It's analyzing the solutions that neural networks actually reach, which differentiates it from a lot of other work. It also additionally shows how other simple model classes can reproduce the same behavior, and these reproductions do not depend on optimization.

"and that already requires studying the specific optimization algorithm in order to understand why it picks certain hypothesis over others in the space." But the OP paper explains how even "guess and check" can generalize similarly to SGD. It's becoming more well understood that the role of the optimizer may have been historically overstated for understanding DL generalization. It seems to be more about loss landscapes.

Don't get me wrong, these references you're linking are super interesting. But they don't take away from the OP paper which is adding something quite valuable to the discussion.