1) The article doesn't say this, but dimensions don't always have to do with locations in space and time, you can treat any value that can continuously vary as a dimension -- for example, a person might have dimensions for personality type, age, hair color, etc, etc.. Seems like they could use this technique to better train CNNs to recognize patterns in a lot of data besides imagery -- fraud detection based on credit card transactions, for example.
2) There are a lot of local and global symmetries in physics -- i wonder what new capabilities adding them to a CNN would enable?
The article is talking specifically about performing convolutions on higher dimensional manifolds. This is different from the more broad concept of data dimensionality typically associated with AI/ML.
Without repeating the article too much, this is important because it can be used to learn very complex systems from a series of lower dimensional projections. Such as, creating a 3d map of a dog from a collection of 2d images of dogs. The resulting system can better detect a dog in a position it's never seen because the CNN has built into it the relationship between 3d space and the 2d representation of that space.
> 2) There are a lot of local and global symmetries in physics -- i wonder what new capabilities adding them to a CNN would enable?
It would be trivial to make any ML model satisfy dimensional homogeneity. Just use only dimensionless variables consistent with the Buckingham Pi theorem. Other symmetries would probably have to be baked in from the start.
As I recall some engineers are developing ML-like models that satisfy all sorts of physical constraints under the name "model order reduction".
"Now, researchers have delivered, with a new theoretical framework for building neural networks that can learn patterns on any kind of geometric surface. These “gauge-equivariant convolutional neural networks,” or gauge CNNs, developed at the University of Amsterdam and Qualcomm AI Research by Taco Cohen, Maurice Weiler, Berkay Kicanaoglu and Max Welling, can detect patterns not only in 2D arrays of pixels, but also on spheres and asymmetrically curved objects. “This framework is a fairly definitive answer to this problem of deep learning on curved surfaces,” Welling said."
Can someone explain to me why advances in actual model performance come from using analogies from physics when there are papers that supposedly provide a mathematical explanation of convolution?
"A Mathematical Theory of Deep ConvolutionalNeural Networks for Feature Extraction":
Not really topology, it's more like group theory, and representations.
Ordinary convnets are a way of building in translational symmetry, which is the group R^2 (in the plane). The work being described extends this to larger symmetry groups, such as rotations of a molecule in 3D (which is SO(3)).
For either of these, you can work in Fourier space instead of real space, where convolutions become products. For ordinary convnets means ordinary FFT, but nobody does that as translating to neighbouring pixels is simple enough. Rotations aren't so simple, and so working in Fourier space can be an efficient way to do things. And the connection to physics is really just that the representation theory of SO(3) is a bread-and-butter exercise there, the basis of atomic theory.
What happens if there are multiple 3D or 4D objects? Do we then need an attention mechanism as well? Or is there some topology where a "where vs what" pathway emerges naturally?
Is this very different from a graph convolutional network (GCN)? Seems like a GCN would have a lot of the same equivariabne properties (i.e. orientation, units of measure, etc)?
_I_ would like to have a VR experience in higher dimensions. It should not be completely impossible to build some kind of actuators that I can somehow attach to my body to sense my orientation and acceleration in fourth dimension.
May I recommend 4D Toys? It's made by the same guy who's developing Miegakure and it has a VR version. The controls are a bit limited though in that user-initiated rotations are limited to 3D.
Something that could go by the same title is the use of tensor networks for ML. I think it works like a pre-optimization step by dimension reduction of the solution space, but if someone could give the right intuitive explanation I'd be much obliged.
It seems to be a way to lessen inductive bias by making decisions about available ML algos. That is, it vastly increases the solution space but remains effective by omitting unlikely solutions.
This is a really interesting application of differential geometry in machine learning! And the allusion at the end to having the system eventually learn the symmetries of the system and make sure of that is really interesting. All the examples they gave were very physical, like climate models, but I imagine you could find symmetries in much more abstract problems that may not be intuitive.
One of the strangest things in AI, to me, is that you can average the weights of multiple different models to create a single model that’s better than any of the individuals.
This is how distributed training often works, for example. Data parallelism.
I don’t understand why it still works in higher dimensions, but it seems to.
The intuition is that the multiple models are “spinning” around the true solution, so averaging gives the final result more quickly. But it works even early in the training process.
When we use data parallelism, we're summing/averaging the gradients induced by different parts of the data, not the weights of the model itself. When using multiple models for ensemble methods, we're summing/averaging the output of the models for a sample. We're not summing/averaging the weights of the models themselves. While averaging the weights of several models might work on a given problem, it definitely doesn't work in general.
w2tanh(w1x) = -1w2tanh(-w1*x)
But if you average the weights in those two equivalent models you get 0.
If you're talking about asynchronous data parallelism, then there can be some averaging of weights, but they all start with the same weights and are re-synched often enough that weights are never too different to break it.
Concentration of measure. If you have a quantity that depends on a large number of random variables but not too strongly on any small subset of them, it tends to behave like a constant. That's the intuition behind the law of large numbers, the central limit theorem, a bunch of concentration inequalities, and model averaging.
The math there is above my paygrade, but it describes a way to structurally combine a certain class of ML models for very significant performance gains. More importantly, it describes why it works.
> One of the strangest things in AI, to me, is that you can average the weights of multiple different models to create a single model that’s better than any of the individuals.
I think this is one of the least strange things in AI. All you're doing is taking N overfitted models (unlikely to be overfit in the same way) and then asserting that the average of those predictions is probably not overfitted as much (regularization). Overfitting as a concept is not restricted to some number of dimensions.
The weights arent averaged unless the training step is synchronous. Even then most of the times it is the gradients that are added up rather than the actual weights.
For inference, I dont think there are many papers that claim direct average of weights perform better than any single model. It is usually the output that is accumulated in some way.
As others commented, you either ensemble models (average the predictions) or average the update (gradient).
For ensembling, the mathematical justification for why this surprising result is true (e.g just averaging many weak but different models gives a better model) is pretty interesting:
https://en.wikipedia.org/wiki/Condorcet%27s_jury_theorem
This article is all over the place with almost no substance. It starts talking about Einstein's theory of relativity, it says this definitively solves curved data and gives almost no insight into what is actually different. It is so bad it makes me want to avoid this site all together.
I thought quantamagazine was above publishing click-bait, but I guess not.
Neural networks already see in "higher dimensions" (whatever that means). Anyone who's ever used neural networks already knows each neuron's branch (i.e. dendrite) of an N-sized vector can already be though of as a "dimension" of a data set. CNN (convolutions) flatten that data (reduce it or seeing the same pattern over less "dendrites", much like PCA, etc.).
CNNs only make sense when working with image data anyways.
> CNNs only make sense when working with image data anyways
Not true, N-dimensional convnets, 1-d convnets (for NLP and time series analysis), spatially sparse convnets, graph and non-Euclidean space convnets, ... exist and are used.
CNNs are akin to multiscale wavelet transforms. They can be applied on different spaces (just as graph wavelet transforms exist).
> CNNs only make sense when working with image data anyways.
Not true, CNNs are used for audio and text as well.
I don't think the title is clickbait, you may be misinterpreting it. Its referring to using CNNs on higher dimensional inputs, not that the layer has multiple dimensions (which has been done since the creation of convnets)
[+] [-] empath75|6 years ago|reply
1) The article doesn't say this, but dimensions don't always have to do with locations in space and time, you can treat any value that can continuously vary as a dimension -- for example, a person might have dimensions for personality type, age, hair color, etc, etc.. Seems like they could use this technique to better train CNNs to recognize patterns in a lot of data besides imagery -- fraud detection based on credit card transactions, for example. 2) There are a lot of local and global symmetries in physics -- i wonder what new capabilities adding them to a CNN would enable?
[+] [-] mywittyname|6 years ago|reply
Without repeating the article too much, this is important because it can be used to learn very complex systems from a series of lower dimensional projections. Such as, creating a 3d map of a dog from a collection of 2d images of dogs. The resulting system can better detect a dog in a position it's never seen because the CNN has built into it the relationship between 3d space and the 2d representation of that space.
[+] [-] btrettel|6 years ago|reply
It would be trivial to make any ML model satisfy dimensional homogeneity. Just use only dimensionless variables consistent with the Buckingham Pi theorem. Other symmetries would probably have to be baked in from the start.
As I recall some engineers are developing ML-like models that satisfy all sorts of physical constraints under the name "model order reduction".
[+] [-] peter_d_sherman|6 years ago|reply
"Now, researchers have delivered, with a new theoretical framework for building neural networks that can learn patterns on any kind of geometric surface. These “gauge-equivariant convolutional neural networks,” or gauge CNNs, developed at the University of Amsterdam and Qualcomm AI Research by Taco Cohen, Maurice Weiler, Berkay Kicanaoglu and Max Welling, can detect patterns not only in 2D arrays of pixels, but also on spheres and asymmetrically curved objects. “This framework is a fairly definitive answer to this problem of deep learning on curved surfaces,” Welling said."
[+] [-] gambler|6 years ago|reply
"A Mathematical Theory of Deep ConvolutionalNeural Networks for Feature Extraction":
https://arxiv.org/pdf/1512.06293.pdf
"Understanding Convolutional Neural Networks with A Mathematical Model":
https://arxiv.org/pdf/1609.04112.pdf
[+] [-] conjectures|6 years ago|reply
A) Studying an existing technique with math.
B) Coming up with a new technique.
You could get a modern engineering consultancy to review your steam engine, but it would still be a steam engine.
[+] [-] hanniabu|6 years ago|reply
[+] [-] z3c0|6 years ago|reply
[+] [-] etaioinshrdlu|6 years ago|reply
[+] [-] improbable22|6 years ago|reply
Ordinary convnets are a way of building in translational symmetry, which is the group R^2 (in the plane). The work being described extends this to larger symmetry groups, such as rotations of a molecule in 3D (which is SO(3)).
For either of these, you can work in Fourier space instead of real space, where convolutions become products. For ordinary convnets means ordinary FFT, but nobody does that as translating to neighbouring pixels is simple enough. Rotations aren't so simple, and so working in Fourier space can be an efficient way to do things. And the connection to physics is really just that the representation theory of SO(3) is a bread-and-butter exercise there, the basis of atomic theory.
[+] [-] MrQuincle|6 years ago|reply
[+] [-] numlock86|6 years ago|reply
[+] [-] _wzsf|6 years ago|reply
[+] [-] jhisiow9839|6 years ago|reply
[+] [-] beefield|6 years ago|reply
[+] [-] uj8efdkjfdshf|6 years ago|reply
[+] [-] ganzuul|6 years ago|reply
It seems to be a way to lessen inductive bias by making decisions about available ML algos. That is, it vastly increases the solution space but remains effective by omitting unlikely solutions.
[+] [-] openasocket|6 years ago|reply
[+] [-] mojomark|6 years ago|reply
[+] [-] cosmic_ape|6 years ago|reply
[+] [-] sillysaurusx|6 years ago|reply
This is how distributed training often works, for example. Data parallelism.
I don’t understand why it still works in higher dimensions, but it seems to.
The intuition is that the multiple models are “spinning” around the true solution, so averaging gives the final result more quickly. But it works even early in the training process.
[+] [-] rprenger|6 years ago|reply
w2tanh(w1x) = -1w2tanh(-w1*x)
But if you average the weights in those two equivalent models you get 0.
If you're talking about asynchronous data parallelism, then there can be some averaging of weights, but they all start with the same weights and are re-synched often enough that weights are never too different to break it.
[+] [-] joker3|6 years ago|reply
[+] [-] gambler|6 years ago|reply
http://jmlr.csail.mit.edu/papers/volume17/16-137/16-137.pdf
The math there is above my paygrade, but it describes a way to structurally combine a certain class of ML models for very significant performance gains. More importantly, it describes why it works.
[+] [-] uoaei|6 years ago|reply
I always thought the preferred method was to average the gradient updates, and pass that to update the single mother-model.
[+] [-] im3w1l|6 years ago|reply
[+] [-] ghostcluster|6 years ago|reply
https://en.wikipedia.org/wiki/The_Wisdom_of_Crowds
[+] [-] MiroF|6 years ago|reply
[+] [-] anonytrary|6 years ago|reply
I think this is one of the least strange things in AI. All you're doing is taking N overfitted models (unlikely to be overfit in the same way) and then asserting that the average of those predictions is probably not overfitted as much (regularization). Overfitting as a concept is not restricted to some number of dimensions.
[+] [-] unknown|6 years ago|reply
[deleted]
[+] [-] pavanky|6 years ago|reply
For inference, I dont think there are many papers that claim direct average of weights perform better than any single model. It is usually the output that is accumulated in some way.
[+] [-] halflings|6 years ago|reply
For ensembling, the mathematical justification for why this surprising result is true (e.g just averaging many weak but different models gives a better model) is pretty interesting: https://en.wikipedia.org/wiki/Condorcet%27s_jury_theorem
[+] [-] ncraig|6 years ago|reply
[+] [-] irenedowd|6 years ago|reply
[deleted]
[+] [-] starpilot|6 years ago|reply
[deleted]
[+] [-] unknown|6 years ago|reply
[deleted]
[+] [-] BubRoss|6 years ago|reply
[+] [-] gojomo|6 years ago|reply
[+] [-] eli_gottlieb|6 years ago|reply
[+] [-] cliqueiq|6 years ago|reply
Neural networks already see in "higher dimensions" (whatever that means). Anyone who's ever used neural networks already knows each neuron's branch (i.e. dendrite) of an N-sized vector can already be though of as a "dimension" of a data set. CNN (convolutions) flatten that data (reduce it or seeing the same pattern over less "dendrites", much like PCA, etc.).
CNNs only make sense when working with image data anyways.
[+] [-] jhj|6 years ago|reply
Not true, N-dimensional convnets, 1-d convnets (for NLP and time series analysis), spatially sparse convnets, graph and non-Euclidean space convnets, ... exist and are used.
CNNs are akin to multiscale wavelet transforms. They can be applied on different spaces (just as graph wavelet transforms exist).
[+] [-] ipsum2|6 years ago|reply
Not true, CNNs are used for audio and text as well.
I don't think the title is clickbait, you may be misinterpreting it. Its referring to using CNNs on higher dimensional inputs, not that the layer has multiple dimensions (which has been done since the creation of convnets)
[+] [-] throwaway_tech|6 years ago|reply
Edit: I should say it is a big step in the direction of my prediction.
[+] [-] gojomo|6 years ago|reply