This is a very interesting perspective on neural networks, is it novel? I've never seen a geometric interpretation of NNs.
In particular this:
> (Apparently determining if knots are trivial is NP. This doesn’t bode well for neural networks.)
Is there any theoretical research on obstructions to NN learning? Not that it would change much the practice (for instance, MLE learning of gaussian mixtures is NP-hard, but everybody does it anyway), but it could shed some light on the geometry of the "hard" instances.
EDIT: for example, to get a sense of how ill-conditioned is deep NN learning, a recent paper [1] shows that if we feed to an object classifying NN an image with crafted but unnoticeable perturbations the predictions change completely.
There is large body of work from the mid 1980s and early 1990s that addresses the question of hardness, sensitivity and robustness from various statistical physics/computer science collaborations. Of course, it is still ongoing, but that was a big period. The basic thrust of this program is to take methods from statistical mechanics and use them calculate not just worst case complexity, but also average case, best case, and everything in between. It is hard to think of a reference to end all references, but as far as books go you might want to check out: http://www.amazon.com/The-Nature-Computation-Cristopher-Moor.... For an article that briefly addresses some results related to kSAT there is: http://www.cs.cornell.edu/selman/papers/pdf/99.nature.phase.....
I have seen Riemann geometry applied to neural networks. Basically the state space is curved, so you can speed up learning if take the curvature into account when sampling in the space of possible neural nets. (you want to explore the hypothesis space evenly)
Information geometry for Neural networks, 1998, Daniel Wagenaar:-
I think the author has confused NP with NP-hard (or NP-complete). Indeed, it is extremely unlikely that the unknottedness problem is NP-complete, as it is already in NP ∩ co-NP (assuming the generalized Riemann hypothesis). Furthermore, we know that the problem of computing many knot invariants is "only" BQP-complete, so it's possible that unknottedness is in BQP (i.e., efficiently solvable by quantum computers), though I don't know the expert opinion on this.
I think machine learning boils down to 2 fundamental issues:
1) Having a model class that is powerful enough to represent the data.
and
2) Being able to find a "good enough" optimal solution over that model class.
In principle over-fitting can be addressed through the choice of the particular optimization criteria in (2).
Neural networks are used because they are powerful enough to represent many data sets of interest and because there exist good algorithms for finding LOCAL optima.
Unfortunately they are complex non-linear, non-convex functions and hence finding a global optimum is most likely NP-hard.
We are left then with the heuristic of hoping that the local optima we are able to find are "close enough" to the global optimum to meet our needs.
This seems to work reasonably well for certain types of problems, like image classification, where the underlying data possess a certain simplicity or "smoothness", but less well for other problems, like logic problems where good and bad solutions may not be "near by" in any sense that the local optimization algorithms are able to discover.
Bloody brilliant. I have longed for a ML methodology that breaks out of the float vector representation because I don't think it describes naturally occurring problems that well, most ontologies are semi-structured. However, this really gets to the nub of the problem, real, but solvable, problems are manifold complexes. I don't think anyone has really built a learner for complexes though, persistent homology is a stab in the right direction but its not there yet IMHO.
I don't like his proof that a 1-2 hidden unit system can't handle the red circle in the blue torus. Feels too handwavy.
My proof: Suppose the matrix W is singular. Then there exists a nonzero vector v for which W v = 0. Find a point x in the red circle, then find a different point y=x+av (for some scalar a) in the blue torus. Wy+b=W(x+av)+b=Wx+aWv+b=Wx+a0+b=wX+b. Fin.
Wouldn't this only apply to neural networks as used for classification? I mean the general paradigm of deforming curves until they're separated by a hyperplane seems pretty obvious now that I see it in front of me, but what about neural networks used to approximate continuous functions?
I'll take a stab at this (I'm a decade out from my last machine learning class, so no guarantees on correctness): The only reason it's fitting a hyperplane is because one class is being mapped to the continuous value -1.0 and the other class is being mapped to the continuous value 1.0 and there's a thresholding step (the hyperplane perpendicular to the line onto which the continuous values are being projected) at the end to determine the class. If you're doing regression instead of classification, your training data will be fed in with more output values than just 1.0 and -1.0 and you'll omit the thresholding at the end, but otherwise the behavior and intuition should be the same as in the article.
The article is exceptionally clear and well presented. We need more of this level of writing about machine learning.
For me, it is most important to note what the article touches only at the end. If you have good features and representation, a linear model may be enough. Real world data and tasks usually don't require knot untying. In my opinion, the model (e.g., neural net) should be kept simple, and the challenge is in finding the right features and representation.
Coates & Ng (2012) wrote:
Many algorithms are available to learn deep hierarchies of
features from unlabeled data, especially images. In many cases, these
algorithms involve multi-layered networks of features (e.g., neural net-
works) that are sometimes tricky to train and tune and are difficult to
scale up to many machines effectively. Recently, it has been found that
K-means clustering can be used as a fast alternative training method.
The main advantage of this approach is that it is very fast and easily
implemented at large scale. On the other hand, employing this method
in practice is not completely trivial: K-means has several limitations, and
care must be taken to combine the right ingredients to get the system
to work well.
This is very interesting. I wonder if anyone knows what can be gained in terms of separating out different topologies by allowing for a restricted family of conformal transformation on top of the tanh functions, of the sort that allows taking the insides out and so forth. Comments?
A bit of an aside: Are there any methods for compressing/collapsing/simplifying a neural network.
What I mean is imagine you've built and trained a neural network, as per the article, it is hard to ascertain exactly what it is doing. I was wondering whether there is work in this area, and it occurred to me a possible first step would be to collapse the neural network to a simpler but functionally equivalent structure.
I imagine this is far more difficult than it sounds, but I just wondered.
The Universal Approximation Theorem[1] asserts that you only ever need one hidden layer, which at least asserts that "an (approximate) simplification exists".
But I can't say off the top of my head how you'd collapse an ANN just two hidden layers into 1. It's not obvious how sigmoid functions compose, but I suppose I should give it more thought...
There are pruning methods which try to find and remove connections and neurons that don't have much effect on the network, essentially simplifying it.
There are also some methods such as HyperNeat which try to find simple structures that can be expanded into a much larger neural network. By sharing weights and having sections of the network that repeated many times.
> One could learn the vector field at fixed points (just take some fixed points from the training set to use as anchors) and interpolate in some manner.
One such interpolation that has nice properties (smooth everywhere, invertible, can be constructed for any set of pointwise correspondences) are diffeomorphisms.
Unfortunately, the mathematical objects are infinite-dimensional, which makes computation with them somewhat involved, even if they are derived from a small-dimensional set of point correspondences. See, e.g., <http://www.cs.unc.edu/Research/MIDAG/pubs/papers/Joshi_TIP_2... for one method of constructing them.
Off topic: I wish, there would be a practical way to program for neuromorphic chips, which are inherently analog. Anybody have some idea or links regarding this field?
[+] [-] ot|12 years ago|reply
In particular this:
> (Apparently determining if knots are trivial is NP. This doesn’t bode well for neural networks.)
Is there any theoretical research on obstructions to NN learning? Not that it would change much the practice (for instance, MLE learning of gaussian mixtures is NP-hard, but everybody does it anyway), but it could shed some light on the geometry of the "hard" instances.
EDIT: for example, to get a sense of how ill-conditioned is deep NN learning, a recent paper [1] shows that if we feed to an object classifying NN an image with crafted but unnoticeable perturbations the predictions change completely.
[1] http://arxiv.org/abs/1312.6199
[+] [-] gajomi|12 years ago|reply
[+] [-] tlarkworthy|12 years ago|reply
Information geometry for Neural networks, 1998, Daniel Wagenaar:-
http://www.danielwagenaar.net/res/papers/98-Wage2.pdf
[+] [-] ejenk|12 years ago|reply
[+] [-] tgflynn|12 years ago|reply
1) Having a model class that is powerful enough to represent the data.
and
2) Being able to find a "good enough" optimal solution over that model class.
In principle over-fitting can be addressed through the choice of the particular optimization criteria in (2).
Neural networks are used because they are powerful enough to represent many data sets of interest and because there exist good algorithms for finding LOCAL optima.
Unfortunately they are complex non-linear, non-convex functions and hence finding a global optimum is most likely NP-hard.
We are left then with the heuristic of hoping that the local optima we are able to find are "close enough" to the global optimum to meet our needs.
This seems to work reasonably well for certain types of problems, like image classification, where the underlying data possess a certain simplicity or "smoothness", but less well for other problems, like logic problems where good and bad solutions may not be "near by" in any sense that the local optimization algorithms are able to discover.
[+] [-] tlarkworthy|12 years ago|reply
[+] [-] yummyfajitas|12 years ago|reply
My proof: Suppose the matrix W is singular. Then there exists a nonzero vector v for which W v = 0. Find a point x in the red circle, then find a different point y=x+av (for some scalar a) in the blue torus. Wy+b=W(x+av)+b=Wx+aWv+b=Wx+a0+b=wX+b. Fin.
[+] [-] ArbitraryLimits|12 years ago|reply
[+] [-] GrantS|12 years ago|reply
[+] [-] sushirain|12 years ago|reply
For me, it is most important to note what the article touches only at the end. If you have good features and representation, a linear model may be enough. Real world data and tasks usually don't require knot untying. In my opinion, the model (e.g., neural net) should be kept simple, and the challenge is in finding the right features and representation.
Coates & Ng (2012) wrote:
Many algorithms are available to learn deep hierarchies of features from unlabeled data, especially images. In many cases, these algorithms involve multi-layered networks of features (e.g., neural net- works) that are sometimes tricky to train and tune and are difficult to scale up to many machines effectively. Recently, it has been found that K-means clustering can be used as a fast alternative training method. The main advantage of this approach is that it is very fast and easily implemented at large scale. On the other hand, employing this method in practice is not completely trivial: K-means has several limitations, and care must be taken to combine the right ingredients to get the system to work well.
http://www.stanford.edu/~acoates/papers/coatesng_nntot2012.p...
[+] [-] gajomi|12 years ago|reply
[+] [-] ot|12 years ago|reply
[+] [-] TomAnthony|12 years ago|reply
What I mean is imagine you've built and trained a neural network, as per the article, it is hard to ascertain exactly what it is doing. I was wondering whether there is work in this area, and it occurred to me a possible first step would be to collapse the neural network to a simpler but functionally equivalent structure.
I imagine this is far more difficult than it sounds, but I just wondered.
[+] [-] xcthulhu|12 years ago|reply
But I can't say off the top of my head how you'd collapse an ANN just two hidden layers into 1. It's not obvious how sigmoid functions compose, but I suppose I should give it more thought...
---------- 1: http://en.wikipedia.org/wiki/Universal_approximation_theorem
[+] [-] Houshalter|12 years ago|reply
There are also some methods such as HyperNeat which try to find simple structures that can be expanded into a much larger neural network. By sharing weights and having sections of the network that repeated many times.
[+] [-] rdtsc|12 years ago|reply
The advantage of SVMs is that after re-mapping the data so it becomes separable, they also guarantee to separate it as clear as possible.
Not sure if they can solve all the type of problems as well but in some cases it is considered a better, more analytic approach.
[+] [-] reader5000|12 years ago|reply
[+] [-] Houshalter|12 years ago|reply
[+] [-] derf_|12 years ago|reply
One such interpolation that has nice properties (smooth everywhere, invertible, can be constructed for any set of pointwise correspondences) are diffeomorphisms.
Unfortunately, the mathematical objects are infinite-dimensional, which makes computation with them somewhat involved, even if they are derived from a small-dimensional set of point correspondences. See, e.g., <http://www.cs.unc.edu/Research/MIDAG/pubs/papers/Joshi_TIP_2... for one method of constructing them.
[+] [-] z3phyr|12 years ago|reply
[+] [-] designium|12 years ago|reply
[+] [-] d1plo1d|12 years ago|reply
[+] [-] aspensmonster|12 years ago|reply