(no title)
ramgorur | 7 years ago
1. It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex.
2. The only way to guarantee is to start the gradient descent from different points in the search space or try with different step sizes if the algorithm only starts from the same point in the search space.
3. Also does "achieving zero training loss" mean the network has converged to the global optima? I used to know you will get zero training loss even if you are at a local minima as well.
Please correct me if I am wrong.
LolWolf|7 years ago
This is false. See, e.g., [0][1].
2. I'm not really sure what the question is here.
3. If your loss is bounded from below (it is a square norm) by 0 and you achieve 0 loss, this means that 0 is a global optimum, since, by definition, no other objective value can be smaller than this number.
---
[0] Theorem A.2 in Udell's Generalized Low-Rank models paper https://arxiv.org/pdf/1410.0342.pdf
[1] B&V Convex Optimization (https://web.stanford.edu/~boyd/cvxbook/), Appendix B.1. In fact, I can't find the reference right now, but you can easily prove that GD with an appropriate step-size converges to a global optimum on this problem when initialized at (0,0), even though the problem is non-convex.
DoctorOetker|7 years ago
I think the OP intended and should have written:
"It's theoretically impossible to guarantee a convergence to global optima using gradient descent for an arbitrary non-convex function."
For example consider the function f(x)=sin^2(pi * x)+sin^2(pi * N/x) this function has multiple global minima at the divisors of N, where it is f(x)==0, if x or N/x is non-integer, it is guaranteed to be positive...
I am not taking a stance on if gradient descent does or does not guarantee finding global minima and is thus able to factorize cryptographic grade RSA products of primes, but the claim does appear to imply it.
Edit: the multiply symbols changed some cursive
unknown|7 years ago
[deleted]
nshm|7 years ago
DoctorOetker|7 years ago
TTPrograms|7 years ago
2) This is "a way" not "The only way".
(If A then B) does not imply (if not A then not B)
ramgorur|7 years ago
I don't understand. How do you prove a gradient descent is guaranteed to escape local minima?
RandyRanderson|7 years ago
Re 2: No.
Re 3: Yes.
simonhughes22|7 years ago
srean|7 years ago
Anyway this universal property of neural networks get a lot of airtime and people go gaga over it. Its a complete red herring. Its not the first example of a universal approximation and not the last. There is no scarcity of universal approximators. There was no such scarcity even 100s of years ago. The explanation of the success of DNNs lie elsewhere.