top | item 8258652

A visual proof that neural nets can compute any function

217 points| jbarrow | 11 years ago |neuralnetworksanddeeplearning.com | reply

80 comments

order
[+] Tloewald|11 years ago|reply
Can't the same be said for Fourier series, which make no claims to be some kind of AI? And likewise humble polynomials:

http://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theor...

[+] mtdewcmu|11 years ago|reply
Yes (I am not expert in neural nets, but that appears to be exactly what this is saying). If you look at what goes into a neural net and compare it to what goes into a Fourier transform, it should be obvious that neural nets have even more than they actually need to do this task.
[+] mooneater|11 years ago|reply
Who cares if they can compute any function. The important question is, can they learn any function, and can they learn in a way that can generalize? (And clearly they can for many useful domains).
[+] wuliwong|11 years ago|reply
This is a chapter from a textbook, one that seems pretty introductory at that. I'd say it is a pretty important piece of information to convey and prove, this ability of neural networks to represent any function.

Seems to me that jumping to whether they can "learn any function" as you asked, would certainly have come after this is first established.

[+] akuma73|11 years ago|reply
What about XOR?

From Wikipedia: In 1969 in a famous monograph entitled Perceptrons, Marvin Minsky and Seymour Papert showed that it was impossible for a single-layer perceptron network to learn an XOR function.

[+] samizdatum|11 years ago|reply
The biologically-inspired nature of neural nets was a steady, mesmerizing flame for machine learning research. A chance to work on an extremely simple, provably universal system that had embarrassingly obvious rhetorical implications for profound-sounding problems like the computability of consciousness, or the computational complexity of the human brain, proved too seductive for the legions of research-moths that flocked to bask in that alluring flame.

The initial flurry of research activity slowed to a background simmer, however. As a steady march of implementation problems was knocked down, it became clear that the limitations of neural nets were inherent to neural nets themselves, rather than any particular implementation detail. For example, the Lagrangians calculated by SVMs are provably convex, in contrast to the lack of a convexity guarantee for neural nets, which happily gradient descent into local minima. There were workarounds, of course, with regularization techniques, stochastic descent, etc. providing some measure of relief, but a migration of research interest away from neural nets seemed increasingly promising, and today, the migration seems largely complete.

[+] sotte|11 years ago|reply
If you're interested in the generalization ability of neural networks I can recommend the following paper: "Intriguing properties of neural networks" http://arxiv.org/pdf/1312.6199v4.pdf

TLDR: The authors create adversarial examples, i.e., they slightly modify the original images, which look exactly the same to humans but neural networks can't classify the images correctly anymore. What does that imply on the ability to generalize? :)

On a more general note: NN are always treated as something magic. I think a "sober view" is that NN are a special way to parameterize non-liner functions. This is neither good nor bad but it's easy to see when you look at, for example, a 2 layer NN:

$f(x) = w_2^T \sigma(W_1 \sigma (W_0 x))$

[+] mjw|11 years ago|reply
Quite. It's not hard to come up with models or families of functions which share this property.

What matters is not only whether they can learn it but how much data they need to learn it to a given degree of accuracy. This is the kind of question addressed by nonparametric statistics and statistical learning theory.

[+] mostly_harmless|11 years ago|reply
I think this ability to compute any function is important, and useful in understanding why neural nets work. Neural nets are essentially a blank computation matrix that, through training, best approximates a function.

comparing neural net neurons to logic gates in a computer circuit or FPGA, gives a nice intuitive understanding of why the training works in the first place. The training essentially sets up a custom virtual circuit for whatever function you need.

Given this property, it seems intuitive that they would be able to learn any function, since any function is representable as logic gates.

[+] Houshalter|11 years ago|reply
Sort of. It depends on the learning algorithm used. Typically some form of local optimization is used that gets it to a good local optima. This is usually good enough and fast. However you can use methods of global optimizations or whatever optimization algorithm you want to use.

NNs themselves are sort of agnostic to the algorithm you use to set their weights. It's sort of like saying "computers are Turing complete, but can we ever prove that human programmers are smart enough to program any function."

[+] skarayan|11 years ago|reply
It computes because it is detached, there is no feeling. To learn, we would need to first create something that is conscious, an ever-changing self. It needs to be in reference to itself, not some notion of compute in a detached paradigm.
[+] tomp|11 years ago|reply
As mentioned in the article, the formal statement is actually "neural nets can approximate (arbitrarily well, using the supremum metric) any continuous function". For other norms, it can also approximate non-continuous functions.
[+] mturmon|11 years ago|reply
You also need the qualifier "...any continuous function, on a compact set."

Once you add all three qualifiers in (approximate/continuous/compact) it starts to sound more like math and less like a miracle.

Incidentally, one thing of great interest is, how does the number of hidden units required behave as a function of dimensionality of the input domain. In dramatic language, "Can neural networks get around the curse of dimensionality?"

The Cybenko proof does not give enlightenment about that question. Andrew Barron (http://www.stat.yale.edu/~arb4/) had some results that seemed to indicate the dependence was moderate (not exponential). I'm not aware what the state of the art currently is.

[+] voidlogic|11 years ago|reply
It would have been pretty interesting if this had NOT held. It would have meant that even though "neural nets can NOT approximate (arbitrarily well, using the supremum metric) any continuous function", a neural network (the humans involved) was able to discover this limitation.

I find the idea of a neural net finding a limitation of a neural net, to be interesting.

[+] gphil|11 years ago|reply
Yep, my first thought upon seeing the title was: wait, what about non-computable functions?
[+] numlocked|11 years ago|reply
This is also why Neural Nets are susceptible to overfitting and fell out of vogue in the 90s :) They will merrily fit themselves, very precisely, to your noisy, wiggly data. Obviously there are ways to combat this, but it seems like an corollary to their 'universality'.
[+] sz4kerto|11 years ago|reply
Feedforward NN's are only useful to a very narrow* set of problems (*-> compared to 'all' problems out there).

Recurrent networks are needed for stateful operation, i.e. where some kind of memory is needed (in any case where the input is spread across some time or the sequence of data is important). And learning in recurrent nets is in very early stages unfortunately.

[+] Houshalter|11 years ago|reply
They are universal function approximators, which means they can map any set of input values to any set of output values. Of course to do this sometimes requires rote memorization of every possible input and it's output, rather than generalizing the function with a few parameters.

Adding more layers improves on this and allows you to make functions that compose multiple smaller functions. The problem with this is the nonlinearities cause the gradients to explode or vanish after a few layers. So the amount of computing power required to train them is huge.

Recurrent NNs had the same problem since they are equivalent to a very deep feed forward network; where every layer is a time step and the weights between every layer are the same.

But the invention of Long Short Term Memory has made training RNNs practical. Basically, as I understand it, some connections do not use nonlinearities so the gradients don't explode or vanish.

[+] mostly_harmless|11 years ago|reply
I wrote about this a few weeks ago:

http://serialprog.blogspot.ca/2014/07/neural-networks-like-c...

My point of view was that neurons in neural nets are essentially analogue logic gates. Given that combinations of logic gates are Turing complete, combinations of neural net neurons should be also.

My writing is not quite as nice or rigorous as the parent post, but the whole point of the blog is to get better at self-expression and explaining things.

[+] bmease|11 years ago|reply
I loved reading that and interacting with the plots. It totally changes the dynamic of learning when you can interact and play with it in real time.
[+] theophrastus|11 years ago|reply
If neural nets can compute any function (as seems neatly proven here) then can they compute any function in more than a single way? If so, then upon applying the novel input (which was our goal following training) how can we know that the particular way which was computed via training set is 'right' for our novel input? If this is all true then it would seem to make neural nets perfectly unreliable as a means to modeling..?
[+] Dn_Ab|11 years ago|reply
This is a wonderful post. One minor aspect which nags me is that when I read "any function", I think any "effectively calculable method". But regular feedforward MLPs are not Turing Complete (will you be going over recurrent or recursive networks?). If so, it would be useful to note this distinction as I've never seen that point for confusion dealt with properly in one place.
[+] coherentpony|11 years ago|reply
Computing a function and approximating a function are two different disciplines.

When approximating a function, one must also talk about the sense in which the approximation is being made. That is, L^2? H^1? Pointwise?

[+] pesenti|11 years ago|reply
Does anybody know if this is true for other machine learning techniques?
[+] mturmon|11 years ago|reply
The same arguments as in the original Cybenko paper, or the Stone-Weierstrass theorem, lend support to the idea that SVMs are universal approximators (with most typical kernels). This has been proven by a couple of authors.

I'm not aware of universal approximation results for random forests, but since they have the same general construction, this would not be surprising.

[+] jokoon|11 years ago|reply
who's to doubt neural networks are completely awesome ?

Any more news about those chips that were optimized for neural networks ? Was it IBM or Samsung ?

[+] j2kun|11 years ago|reply
We have very few theoretical results about neural networks. So not completely awesome.
[+] signa11|11 years ago|reply
may i also humbly suggest the two vol. series called "parallel distributed processing" (rumelhart et al) which provides an excellent overview of early NN research.
[+] mikeash|11 years ago|reply
I liked this article a lot, but I found it extremely confusing how some of the diagrams were interactive and some weren't. Why not make them all interactive? Barring that, an obvious visual indicator when one is interactive would be handy. As it was, I clicked on a lot of static images.