> Brouwer’s fixed point theorem. The theorem says that for any continuous function, there is guaranteed to be one point that the function leaves unchanged — a fixed point, as it’s known. This is true in daily life. If you stir a glass of water, the theorem guarantees that there absolutely must be one particle of water that will end up in the same place it started from.
Wait. What if by stirring I shook the glass and at the end set it back down a foot away from its initial location. Is this author claiming that Brouwer’s theorem guarantees a particle floating somewhere where the glass used to be? Unchanged?
I can imagine an unlimited number of scenarios where I can guarantee the water particles are displaced.
This seems like a bad application of the theorem, right?
A better example I've seen for the theorem is that if you take a paper map of a country, messily schrunch it up into a ball and drop it somewhere in the country, there will be at least one point on the map that is exactly above the corresponding real point in the country.
As others have said, it's meant for mappings of the space to itself. So stirring the water, but not moving the glass.
But anyway, the theorem works only for continiuous mappings. The moment they started mentioning "water particles", instead of some hypothetical fluid that is a continuous block, the theorem no longer applied. You could break it by mirroring the position of every particle. There's still a fixed point (the line of mirroring), but there's no obligation that there's a particle on that line.
It only applies when you're mapping a space onto itself (plus a couple other qualifications), so you have to put the glass back in the same place.
I think of it as a generalization of the intermediate value theorem - some things are going left, some things are going right, one thing must be sitting still in between.
They left off some qualifiers. According to wikipedia, it's applicable to a continuous function mapping a nonempty compact convex set to itself. Which ends up making a lot more sense to me.
The theorem assumes the function has identical domain and codomain. Eg f(x) = x^3 maps [0, 1] to itself.
So that's why they gave the glass stirring example, the domain here is the whole volume of water. So as long as it ends up in the same volume of water, the process of stirring is assumed to be continuous and therefore this theorem applies. Your example changes where it ends up (ie not back to the same domain) and so it cannot be applied.
I'm misunderstanding the theorem. Take f(x) = x + 1. That is a continuous function. But there does not exist an x where f(x) ~ x holds. What am I missing?
> If you stir a glass of water, the theorem guarantees that there absolutely must be one particle of water that will end up in the same place it started from.
Wait what? If you stir a glass long enough, any configuration of particles should be possible. For instance you can imagine "cutting" the water like a deck of cards.
The full theorem is a continuous mapping back on itself. So, only the water in the glass mapped back to the glass. Pour it out to another glass, and you no longer mapped back to the glass. Pour it back into original glass, and you are back to having a fixed point possible. And, obviously, only with respect to the mapping in the glass.
From problems that require recognition of patters, whether in images or sounds, a non-linear function represented through neural networks and trained through gradient descent just suffices to produce smoothly distributed fuzzy classification that just works.
This setup is not a good prior for anything that requires more precise solution.
What would you search for, the zero/root in the derivative? Imagine the derivative is a function like y=x^2-eps with some small eps, say 0.000001. For binary search like that to work, you need to start with two values x1 and x2 such that sign(y1) != sign(y2). But with the derivative above those values are hard to find with random initialization.
More generally, for binary search to find a value x such that f(x)=0, we need to have f(any value less than x)<0 and f(any value greater than x)>0. We can't guarantee these properties for general derivatives.
This site (QuantaMagazine) had really excellent quality content when it first launched,like the one at the OP, now it has become sort of IFLScience in terms of editorializing.
Drop the quadratic term, equate to 0, and you get an approximate solution for x for the next iteration x_{n+1}:
x_{n+1} = x_n - Df(x_n)^{-1} * f(x_n)
Like this it's the Newton method. The problem is that the Jacobian Df(x_n) is a matrix of size k x k, and inverting it may require O(k^3) work.
So, pretty much all schemes are based on approximations of the Jacobian.
Gradient descent for example replaces Df(x_n)^{-1} with a scalar a, so it's O(1) work to "compute" it:
x_{n+1} = x_n - a * f(x_n)
A method like L-BFGS tries to build a relatively cheap approximation to the inverse of the Jacobian during iteration, resulting in O(k) work to compute:
x_{n+1} = x_n - P * f(x_n)
Other methods may exploit sparsity of the Jacobian, or solve the linear system Df(x_n)z = f(x_n) only approximately for z using e.g. iterative methods.
Note: in optimization problems, the function f is typically a gradient of a cost function, and the jacobian is then the hessian of that cost function.
There are probably ly a hundred. Many people have asked for research into this for I don't know 20 years. But the hype doesn't die down and modern trends have mostly involved doing more GD with bigger models...
A colleague and myself experimented with some alternatives and passed notes once in a while... For certain modelling problems you can save literal gpu days by going against the grain and get better results.
Gradient descent is great because it is first order, thus cheap to compute since you don’t need a Hessian, points to a descent direction, works with anything that is differentiable or piece-wise differentiable without caveats, and given the millions of parameters in today’s there is always a descent direction.
If you do population methods, you suffer in terms of memory because you need to keep all those candidates, evaluate them individually, and then update them. This bounds the memory you can use.
More memory means more parameters, means you enter the interpolation scheme means your model behaves well in real world.
If you try to go with second order optimisation methods then you need to go for Hessian free methods as computing the Hessian is computationally intractable for large NNs.
You can attempt to build a local model of the loss landscape but that is expensive and has many caveats.
Nocedal et al is a recommended read for numerical optimisation theory.
If accuracy isn't as important as speed for you, you can sample 1000 points and pick the smallest one. This point will probably be smaller than 99% of the function, but I wouldn't recommend deploying this in production.
Not to sound snarky, but is there any example where theoretical computer science has actually been useful? The results are often like "Here is a proof this will be slow if you scale it up enough" -- "Yeah we knew that for 20 years, it doesn't matter"
Anything at scale. Anything that uses a database. Anything that uses anything smarter than bubblesort to organize results. Anything that uses a tree (or a trie) internally to represent the data and make some sort of order from it. CS isn't just about "hey this thing is slower than this thing", it's also about making things possible in the first place. If you've used Google (or Facebook or Microsoft or Netflix), they're only possible due to deep computer science work. Which videos gets chosen to be stored on the edge cache? Computer science algorithm.
I liked a few results. . For example, smooth analysis by Spielman showed why so many algorithms works very well in practice even though their worst case complexity is terrible e.g. simplex method. Second, concave-convex optimization shows that you will definitely reach closer to global optimum at each iteration. I guess its comforting to know a reason algo behave the way they do.
Sampling theorem is also case in point. Also compressed sensing. Information theory will tell you there is no point trying any harder at compression when you reached epsilon closer to the theoretical limit.
> is there any example where theoretical computer science has actually been useful?
Anything requiring extremely optimized algorithms will generally start in academia and work its way outward as the techniques are adopted in commercial software. Sorting and compression algorithms are two examples that come to mind. Even games development which has a history of non-academic roots now borrows heavily from academic research to refine things like culling algorithms or optimized computational methods for specialized physics calculations.
[+] [-] sockaddr|2 years ago|reply
Wait. What if by stirring I shook the glass and at the end set it back down a foot away from its initial location. Is this author claiming that Brouwer’s theorem guarantees a particle floating somewhere where the glass used to be? Unchanged?
I can imagine an unlimited number of scenarios where I can guarantee the water particles are displaced.
This seems like a bad application of the theorem, right?
[+] [-] BorisTheBrave|2 years ago|reply
As others have said, it's meant for mappings of the space to itself. So stirring the water, but not moving the glass.
But anyway, the theorem works only for continiuous mappings. The moment they started mentioning "water particles", instead of some hypothetical fluid that is a continuous block, the theorem no longer applied. You could break it by mirroring the position of every particle. There's still a fixed point (the line of mirroring), but there's no obligation that there's a particle on that line.
[+] [-] GeneralMayhem|2 years ago|reply
I think of it as a generalization of the intermediate value theorem - some things are going left, some things are going right, one thing must be sitting still in between.
[+] [-] kadoban|2 years ago|reply
[+] [-] KolenCh|2 years ago|reply
So that's why they gave the glass stirring example, the domain here is the whole volume of water. So as long as it ends up in the same volume of water, the process of stirring is assumed to be continuous and therefore this theorem applies. Your example changes where it ends up (ie not back to the same domain) and so it cannot be applied.
[+] [-] rowanG077|2 years ago|reply
[+] [-] ninepoints|2 years ago|reply
[+] [-] seydor|2 years ago|reply
[+] [-] stormfather|2 years ago|reply
Wait what? If you stir a glass long enough, any configuration of particles should be possible. For instance you can imagine "cutting" the water like a deck of cards.
[+] [-] taeric|2 years ago|reply
[+] [-] sva_|2 years ago|reply
[+] [-] T-A|2 years ago|reply
https://en.wikipedia.org/wiki/Shaken,_not_stirred
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] sorokod|2 years ago|reply
[+] [-] jjtheblunt|2 years ago|reply
[+] [-] ShamelessC|2 years ago|reply
TFP
[+] [-] nerdponx|2 years ago|reply
[+] [-] barrenko|2 years ago|reply
[+] [-] adrian_mrd|2 years ago|reply
[+] [-] eterevsky|2 years ago|reply
And here are the complexity classes that are mentioned in the abstract: https://complexityzoo.net/Complexity_Zoo:P#ppad and https://complexityzoo.net/Complexity_Zoo:P#pls.
[+] [-] gauddasa|2 years ago|reply
This setup is not a good prior for anything that requires more precise solution.
[+] [-] frankreyes|2 years ago|reply
Because that's just an iterative/recursive version of binary/ternary search.
[+] [-] alfla|2 years ago|reply
[+] [-] AbrahamParangi|2 years ago|reply
[+] [-] n2d4|2 years ago|reply
More generally, for binary search to find a value x such that f(x)=0, we need to have f(any value less than x)<0 and f(any value greater than x)>0. We can't guarantee these properties for general derivatives.
[+] [-] whatever1|2 years ago|reply
[+] [-] eterevsky|2 years ago|reply
[+] [-] BazookaMusic|2 years ago|reply
https://en.m.wikipedia.org/wiki/Bisection_method
[+] [-] amelius|2 years ago|reply
[+] [-] rg111|2 years ago|reply
[+] [-] geysersam|2 years ago|reply
[+] [-] antegamisou|2 years ago|reply
[+] [-] seydor|2 years ago|reply
[+] [-] danielEM|2 years ago|reply
[+] [-] stabbles|2 years ago|reply
So, pretty much all schemes are based on approximations of the Jacobian.
Gradient descent for example replaces Df(x_n)^{-1} with a scalar a, so it's O(1) work to "compute" it:
A method like L-BFGS tries to build a relatively cheap approximation to the inverse of the Jacobian during iteration, resulting in O(k) work to compute: Other methods may exploit sparsity of the Jacobian, or solve the linear system Df(x_n)z = f(x_n) only approximately for z using e.g. iterative methods.Note: in optimization problems, the function f is typically a gradient of a cost function, and the jacobian is then the hessian of that cost function.
[+] [-] thumbuddy|2 years ago|reply
A colleague and myself experimented with some alternatives and passed notes once in a while... For certain modelling problems you can save literal gpu days by going against the grain and get better results.
Oh well...
[+] [-] PartiallyTyped|2 years ago|reply
Gradient descent is great because it is first order, thus cheap to compute since you don’t need a Hessian, points to a descent direction, works with anything that is differentiable or piece-wise differentiable without caveats, and given the millions of parameters in today’s there is always a descent direction.
If you do population methods, you suffer in terms of memory because you need to keep all those candidates, evaluate them individually, and then update them. This bounds the memory you can use.
More memory means more parameters, means you enter the interpolation scheme means your model behaves well in real world.
If you try to go with second order optimisation methods then you need to go for Hessian free methods as computing the Hessian is computationally intractable for large NNs.
You can attempt to build a local model of the loss landscape but that is expensive and has many caveats.
Nocedal et al is a recommended read for numerical optimisation theory.
[+] [-] matteoraso|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] cubefox|2 years ago|reply
[+] [-] fragmede|2 years ago|reply
[+] [-] nottorp|2 years ago|reply
What was that game that took forever to start up because the code for parsing the config files was accidentally quadratic or exponential?
Edit: also, the standard library of any programming language tends to have quite a bit of theoretical computer science inside.
[+] [-] dilawar|2 years ago|reply
Sampling theorem is also case in point. Also compressed sensing. Information theory will tell you there is no point trying any harder at compression when you reached epsilon closer to the theoretical limit.
[+] [-] pravus|2 years ago|reply
Anything requiring extremely optimized algorithms will generally start in academia and work its way outward as the techniques are adopted in commercial software. Sorting and compression algorithms are two examples that come to mind. Even games development which has a history of non-academic roots now borrows heavily from academic research to refine things like culling algorithms or optimized computational methods for specialized physics calculations.
[+] [-] bawolff|2 years ago|reply