Computer scientists discover limits of gradient descent (2021)

[+] sockaddr|2 years ago|reply

> Brouwer’s fixed point theorem. The theorem says that for any continuous function, there is guaranteed to be one point that the function leaves unchanged — a fixed point, as it’s known. This is true in daily life. If you stir a glass of water, the theorem guarantees that there absolutely must be one particle of water that will end up in the same place it started from.

Wait. What if by stirring I shook the glass and at the end set it back down a foot away from its initial location. Is this author claiming that Brouwer’s theorem guarantees a particle floating somewhere where the glass used to be? Unchanged?

I can imagine an unlimited number of scenarios where I can guarantee the water particles are displaced.

This seems like a bad application of the theorem, right?

[+] BorisTheBrave|2 years ago|reply

A better example I've seen for the theorem is that if you take a paper map of a country, messily schrunch it up into a ball and drop it somewhere in the country, there will be at least one point on the map that is exactly above the corresponding real point in the country.

As others have said, it's meant for mappings of the space to itself. So stirring the water, but not moving the glass.

But anyway, the theorem works only for continiuous mappings. The moment they started mentioning "water particles", instead of some hypothetical fluid that is a continuous block, the theorem no longer applied. You could break it by mirroring the position of every particle. There's still a fixed point (the line of mirroring), but there's no obligation that there's a particle on that line.

[+] GeneralMayhem|2 years ago|reply

It only applies when you're mapping a space onto itself (plus a couple other qualifications), so you have to put the glass back in the same place.

I think of it as a generalization of the intermediate value theorem - some things are going left, some things are going right, one thing must be sitting still in between.

[+] kadoban|2 years ago|reply

They left off some qualifiers. According to wikipedia, it's applicable to a continuous function mapping a nonempty compact convex set to itself. Which ends up making a lot more sense to me.

[+] KolenCh|2 years ago|reply

The theorem assumes the function has identical domain and codomain. Eg f(x) = x^3 maps [0, 1] to itself.

So that's why they gave the glass stirring example, the domain here is the whole volume of water. So as long as it ends up in the same volume of water, the process of stirring is assumed to be continuous and therefore this theorem applies. Your example changes where it ends up (ie not back to the same domain) and so it cannot be applied.

[+] rowanG077|2 years ago|reply

I'm misunderstanding the theorem. Take f(x) = x + 1. That is a continuous function. But there does not exist an x where f(x) ~ x holds. What am I missing?

[+] ninepoints|2 years ago|reply

I do not understand this comment. In what world does "stirring" fluid in a cup imply any sort of dispacement of the cup itself?

[+] seydor|2 years ago|reply

You are not just stirring, then

[+] stormfather|2 years ago|reply

> If you stir a glass of water, the theorem guarantees that there absolutely must be one particle of water that will end up in the same place it started from.

Wait what? If you stir a glass long enough, any configuration of particles should be possible. For instance you can imagine "cutting" the water like a deck of cards.

[+] taeric|2 years ago|reply

The full theorem is a continuous mapping back on itself. So, only the water in the glass mapped back to the glass. Pour it out to another glass, and you no longer mapped back to the glass. Pour it back into original glass, and you are back to having a fixed point possible. And, obviously, only with respect to the mapping in the glass.

[+] sva_|2 years ago|reply

I wonder how this really applies, since IEEE 754 floating points are not continuous.

[+] T-A|2 years ago|reply

As James Bond would say,

https://en.wikipedia.org/wiki/Shaken,_not_stirred

[+] unknown|2 years ago|reply

[deleted]

[+] unknown|2 years ago|reply

[deleted]

[+] sorokod|2 years ago|reply

That function is from a compact convex set to itself.

[+] jjtheblunt|2 years ago|reply

i remember the example of "you can't comb the hair on a coconut without a whorl".

[+] ShamelessC|2 years ago|reply

https://arxiv.org/abs/2011.01929

TFP

[+] nerdponx|2 years ago|reply

Thank you.

[+] barrenko|2 years ago|reply

Tfp?

[+] adrian_mrd|2 years ago|reply

Good to note that the article was published on "August 17, 2021"

[+] eterevsky|2 years ago|reply

Here's the original paper: https://arxiv.org/abs/2011.01929

And here are the complexity classes that are mentioned in the abstract: https://complexityzoo.net/Complexity_Zoo:P#ppad and https://complexityzoo.net/Complexity_Zoo:P#pls.

[+] gauddasa|2 years ago|reply

From problems that require recognition of patters, whether in images or sounds, a non-linear function represented through neural networks and trained through gradient descent just suffices to produce smoothly distributed fuzzy classification that just works.

This setup is not a good prior for anything that requires more precise solution.

[+] frankreyes|2 years ago|reply

Hobest question, why not just use a binary/ternary search? It's because there's no known a priori how many local maximum and/or minimums?

Because that's just an iterative/recursive version of binary/ternary search.

[+] alfla|2 years ago|reply

How would you do a binary search on a continuous space without discretizing?

[+] AbrahamParangi|2 years ago|reply

Try doing binary search on a space of nontrivial dimensionality (ie, not 1, 2, or 3) and you will have your answer.

[+] n2d4|2 years ago|reply

What would you search for, the zero/root in the derivative? Imagine the derivative is a function like y=x^2-eps with some small eps, say 0.000001. For binary search like that to work, you need to start with two values x1 and x2 such that sign(y1) != sign(y2). But with the derivative above those values are hard to find with random initialization.

More generally, for binary search to find a value x such that f(x)=0, we need to have f(any value less than x)<0 and f(any value greater than x)>0. We can't guarantee these properties for general derivatives.

[+] whatever1|2 years ago|reply

We do, it’s called global optimization. It works for small problems with ~ 1000 variables or so. Check BARON and the branch & reduce algorithm

[+] eterevsky|2 years ago|reply

Binary search only works on sorted data.

[+] BazookaMusic|2 years ago|reply

This is probably what you're thinking of:

https://en.m.wikipedia.org/wiki/Bisection_method

[+] amelius|2 years ago|reply

Because of the dimensionality of the problem. Bracketing is easy to do in 1D space, not so in N-dimensional space, with large N.

[+] rg111|2 years ago|reply

Because the search space is tremendously huge.

[+] geysersam|2 years ago|reply

Binary search only works in 1D parameter space

[+] antegamisou|2 years ago|reply

This site (QuantaMagazine) had really excellent quality content when it first launched,like the one at the OP, now it has become sort of IFLScience in terms of editorializing.

[+] seydor|2 years ago|reply

Their articles are often rehashes of press releases, but this one seems to be of the better ones

[+] danielEM|2 years ago|reply

What are the alternatives to gradient descent?

[+] stabbles|2 years ago|reply

If you solve f(x) = 0 where f: ℝ^k -> ℝ^k maps vectors to vectors, most schemes are based on Taylor expansion around x_n

    f(x) = f(x_n) + Df(x_n)(x - x_n) + O(||x - x_n||^2)

Drop the quadratic term, equate to 0, and you get an approximate solution for x for the next iteration x_{n+1}:

    x_{n+1} = x_n - Df(x_n)^{-1} * f(x_n)

Like this it's the Newton method. The problem is that the Jacobian Df(x_n) is a matrix of size k x k, and inverting it may require O(k^3) work.

So, pretty much all schemes are based on approximations of the Jacobian.

Gradient descent for example replaces Df(x_n)^{-1} with a scalar a, so it's O(1) work to "compute" it:

    x_{n+1} = x_n - a * f(x_n)

A method like L-BFGS tries to build a relatively cheap approximation to the inverse of the Jacobian during iteration, resulting in O(k) work to compute:

    x_{n+1} = x_n - P * f(x_n)

Other methods may exploit sparsity of the Jacobian, or solve the linear system Df(x_n)z = f(x_n) only approximately for z using e.g. iterative methods.

Note: in optimization problems, the function f is typically a gradient of a cost function, and the jacobian is then the hessian of that cost function.

[+] thumbuddy|2 years ago|reply

There are probably ly a hundred. Many people have asked for research into this for I don't know 20 years. But the hype doesn't die down and modern trends have mostly involved doing more GD with bigger models...

A colleague and myself experimented with some alternatives and passed notes once in a while... For certain modelling problems you can save literal gpu days by going against the grain and get better results.

Oh well...

[+] PartiallyTyped|2 years ago|reply

None really.

Gradient descent is great because it is first order, thus cheap to compute since you don’t need a Hessian, points to a descent direction, works with anything that is differentiable or piece-wise differentiable without caveats, and given the millions of parameters in today’s there is always a descent direction.

If you do population methods, you suffer in terms of memory because you need to keep all those candidates, evaluate them individually, and then update them. This bounds the memory you can use.

More memory means more parameters, means you enter the interpolation scheme means your model behaves well in real world.

If you try to go with second order optimisation methods then you need to go for Hessian free methods as computing the Hessian is computationally intractable for large NNs.

You can attempt to build a local model of the loss landscape but that is expensive and has many caveats.

Nocedal et al is a recommended read for numerical optimisation theory.

[+] matteoraso|2 years ago|reply

If accuracy isn't as important as speed for you, you can sample 1000 points and pick the smallest one. This point will probably be smaller than 99% of the function, but I wouldn't recommend deploying this in production.

[+] unknown|2 years ago|reply

[deleted]

[+] cubefox|2 years ago|reply

Not to sound snarky, but is there any example where theoretical computer science has actually been useful? The results are often like "Here is a proof this will be slow if you scale it up enough" -- "Yeah we knew that for 20 years, it doesn't matter"

[+] fragmede|2 years ago|reply

Anything at scale. Anything that uses a database. Anything that uses anything smarter than bubblesort to organize results. Anything that uses a tree (or a trie) internally to represent the data and make some sort of order from it. CS isn't just about "hey this thing is slower than this thing", it's also about making things possible in the first place. If you've used Google (or Facebook or Microsoft or Netflix), they're only possible due to deep computer science work. Which videos gets chosen to be stored on the edge cache? Computer science algorithm.

[+] nottorp|2 years ago|reply

It tends to not matter until it suddenly does and bites your behind.

What was that game that took forever to start up because the code for parsing the config files was accidentally quadratic or exponential?

Edit: also, the standard library of any programming language tends to have quite a bit of theoretical computer science inside.

[+] dilawar|2 years ago|reply

I liked a few results. . For example, smooth analysis by Spielman showed why so many algorithms works very well in practice even though their worst case complexity is terrible e.g. simplex method. Second, concave-convex optimization shows that you will definitely reach closer to global optimum at each iteration. I guess its comforting to know a reason algo behave the way they do.

Sampling theorem is also case in point. Also compressed sensing. Information theory will tell you there is no point trying any harder at compression when you reached epsilon closer to the theoretical limit.

[+] pravus|2 years ago|reply

> is there any example where theoretical computer science has actually been useful?

Anything requiring extremely optimized algorithms will generally start in academia and work its way outward as the techniques are adopted in commercial software. Sorting and compression algorithms are two examples that come to mind. Even games development which has a history of non-academic roots now borrows heavily from academic research to refine things like culling algorithms or optimized computational methods for specialized physics calculations.

[+] bawolff|2 years ago|reply

An obvious one that comes to mind - this problem is np-complete so pragmatically i can stop looking for a fast algo.

121 comments