(no title)
skeptic_69 | 7 years ago
The typical way of showing generalization in ML is to show that if we have some low or zero error solution on the test data-set, for a large enough dataset, with high probability, the error on our training data set is close to the error on the real and unknown distribution. The first step which is basically "find a low error hypothesis on the training data" is called the ERM principle.
In practice we observe stochastic gradient descent works pretty well in solving the ERM problem and the solutions generalize well (perform well when deployed).
This is very weird since neural networks are really weird objects with very non-linear and non-convex behavior and gradient descent shouldn't play well with weird bumps and curves and valleys.
People want to show mathematically that stochastic gradient descent does well on neural networks.
This paper claims gradient descent is effective at minimizing quadratic loss on the training data.
If we could improve the results to show that on the true distribution we also have low loss-that might be compelling that gradient descent converges to the minimum error solution.
None of this explicitly stated since this is a well understood part of basic literature in learning theory.
Showing an algorithm can do erm on the hypothesis class is the first and (easier ) part of showing generalization.
If you want a good reference that explains this in a more coherent way I recommend looking at the first 4 chapters of understanding machine learning theory by Shai-Shalev Schwartz.
If you still think the comments I was responding to are not totally incoherent-take note of the fact that the very first sentence in the paper is "One of the mysteries in deep learning is random initialized first order methods like gradient descent achieve zero training loss"
No comments yet.