top | item 24840647

(no title)

acadien | 5 years ago

Hey @moultano in response to your argument about walls and Nets not being in a minima, its my understanding nets always live on high dimensional saddle points and that's commonly referred to in literature. Even when you're optimizing you're just moving towards ever lower cost saddles that are closer to the optimum but almost never a local optimum (for the reasons spelled out in your post).

discuss

order

moultano|5 years ago

Thank you. Several people have pointed that out, and I'm probably not reading the right papers. Is it common when people introduce a new flavor of adaptive SGD to address how it handles saddles specifically? It is probably just a a matter of what manages to bubble up to me rather than what work is actually getting done, but I felt like the non-convergence of ADAM got talked about a lot, but haven't seen people talking as much about how optimizers behave differently on the landscapes we actually observe.

acadien|5 years ago

Saddles are a way of conceptualizing high dimensional optimization problems. If you have a 3 dimensional surface you can imagine a saddle as an isocurve that follows a minima in at least one dimension.

Another way to conceptualize these is to think of being at the minima of a parabola in 2 dimensions, but then seeing you're not in a minima in a 3rd dimension. Any time you're in a minima in at least 1 dimension, you're on a saddle.

You can extend this concept to a neural net which lives in millions of dimensions, undergoing SGD. When beginning an optimization run SGD moves in some direction to minimize the a bundled cost, inevitably stumbling into minima in (usually) many dimensions. Subsequent iterations will shift some dimensions out of minima and other dimensions into minima, the net is always living on a saddle during this process.

There are many papers that discuss the process in these terms and others that implicitly use it. I wouldn't say its a "hot area of research" but more of a tool for thinking about these processes and sometimes gaining some insight in to why things get stuck during training.

muppet_frog|5 years ago

This paper makes the points that it's the saddles and not local minima that are the problem: https://arxiv.org/abs/1406.2572 It was the basis for adding 'momentum' to optimizers - so that you could skate across the saddles.