top | item 18436117

(no title)

ramgorur | 7 years ago

I did not understand the paper very well.

1. It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex.

2. The only way to guarantee is to start the gradient descent from different points in the search space or try with different step sizes if the algorithm only starts from the same point in the search space.

3. Also does "achieving zero training loss" mean the network has converged to the global optima? I used to know you will get zero training loss even if you are at a local minima as well.

Please correct me if I am wrong.

discuss

order

LolWolf|7 years ago

1. > It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex.

This is false. See, e.g., [0][1].

2. I'm not really sure what the question is here.

3. If your loss is bounded from below (it is a square norm) by 0 and you achieve 0 loss, this means that 0 is a global optimum, since, by definition, no other objective value can be smaller than this number.

---

[0] Theorem A.2 in Udell's Generalized Low-Rank models paper https://arxiv.org/pdf/1410.0342.pdf

[1] B&V Convex Optimization (https://web.stanford.edu/~boyd/cvxbook/), Appendix B.1. In fact, I can't find the reference right now, but you can easily prove that GD with an appropriate step-size converges to a global optimum on this problem when initialized at (0,0), even though the problem is non-convex.

DoctorOetker|7 years ago

that reference you can't find right now seems rather pertinent?

I think the OP intended and should have written:

"It's theoretically impossible to guarantee a convergence to global optima using gradient descent for an arbitrary non-convex function."

For example consider the function f(x)=sin^2(pi * x)+sin^2(pi * N/x) this function has multiple global minima at the divisors of N, where it is f(x)==0, if x or N/x is non-integer, it is guaranteed to be positive...

I am not taking a stance on if gradient descent does or does not guarantee finding global minima and is thus able to factorize cryptographic grade RSA products of primes, but the claim does appear to imply it.

Edit: the multiply symbols changed some cursive

nshm|7 years ago

The title of the paper is really misleading. The comments here are even more misleading. The key is their theorem where they say "with high probably over random initialization". They initialize many times and sometimes it converges. Single initialization can stuck in local minimum of course.

DoctorOetker|7 years ago

but then there is very little of interest: (assuming enough smoothness and Lipschitz continuity) one expects every global minima to have a convex neighbourhood such that gradient descent starting within the neighbourhood reaches the global minimum. The initialize many times and sometimes it converges is just saying the obvious "there exist initial positions for which GD succeeds in finding a global minima"... is my interpretation correct?

TTPrograms|7 years ago

1) Just because you can prove convergence for convex functions does not mean you can't prove it for non-convex functions.

2) This is "a way" not "The only way".

(If A then B) does not imply (if not A then not B)

ramgorur|7 years ago

>Just because you can prove convergence for convex functions does not mean you can't prove it for non-convex functions.

I don't understand. How do you prove a gradient descent is guaranteed to escape local minima?

RandyRanderson|7 years ago

FF NNs of even one hidden layer are universal approximators. That is, they do find the global min. What this doesn't tell you is that it's likely a huge graph and will take a looong time to optimize for even trivial data sets. There's lots of proofs around. That's why SGD is used, and for only a small subset of training points at a time.

Re 2: No.

Re 3: Yes.

simonhughes22|7 years ago

They are universal approximators, although the term approximator means there's an upper bound on the accuracy of how well they emulate a given function (based on the size of the network). However, just because they are universal approximators doesn't mean that you can automatically infer the optimal number of connections and weight for each of those connections in order to minimize the loss over some dataset (derived from some function). Being able to be a universal approximator does not imply you can automatically learn the best approximation of a given function. It's the difference between being capable of learning something, and having learned it. If that makes sense.

srean|7 years ago

I dont think you understand what universal approximation means. It means there are parameter settings that would reduce the approximation as much as you want. Its an existential property. It does not mean that those parameters can be found.

Anyway this universal property of neural networks get a lot of airtime and people go gaga over it. Its a complete red herring. Its not the first example of a universal approximation and not the last. There is no scarcity of universal approximators. There was no such scarcity even 100s of years ago. The explanation of the success of DNNs lie elsewhere.