top | item 22002818

An idea from physics helps AI see in higher dimensions

230 points| theafh | 6 years ago |quantamagazine.org | reply

75 comments

order
[+] empath75|6 years ago|reply
Pretty amazing work -- a couple of thoughts:

1) The article doesn't say this, but dimensions don't always have to do with locations in space and time, you can treat any value that can continuously vary as a dimension -- for example, a person might have dimensions for personality type, age, hair color, etc, etc.. Seems like they could use this technique to better train CNNs to recognize patterns in a lot of data besides imagery -- fraud detection based on credit card transactions, for example. 2) There are a lot of local and global symmetries in physics -- i wonder what new capabilities adding them to a CNN would enable?

[+] mywittyname|6 years ago|reply
The article is talking specifically about performing convolutions on higher dimensional manifolds. This is different from the more broad concept of data dimensionality typically associated with AI/ML.

Without repeating the article too much, this is important because it can be used to learn very complex systems from a series of lower dimensional projections. Such as, creating a 3d map of a dog from a collection of 2d images of dogs. The resulting system can better detect a dog in a position it's never seen because the CNN has built into it the relationship between 3d space and the 2d representation of that space.

[+] btrettel|6 years ago|reply
> 2) There are a lot of local and global symmetries in physics -- i wonder what new capabilities adding them to a CNN would enable?

It would be trivial to make any ML model satisfy dimensional homogeneity. Just use only dimensionless variables consistent with the Buckingham Pi theorem. Other symmetries would probably have to be baked in from the start.

As I recall some engineers are developing ML-like models that satisfy all sorts of physical constraints under the name "model order reduction".

[+] peter_d_sherman|6 years ago|reply
Excerpt:

"Now, researchers have delivered, with a new theoretical framework for building neural networks that can learn patterns on any kind of geometric surface. These “gauge-equivariant convolutional neural networks,” or gauge CNNs, developed at the University of Amsterdam and Qualcomm AI Research by Taco Cohen, Maurice Weiler, Berkay Kicanaoglu and Max Welling, can detect patterns not only in 2D arrays of pixels, but also on spheres and asymmetrically curved objects. “This framework is a fairly definitive answer to this problem of deep learning on curved surfaces,” Welling said."

[+] gambler|6 years ago|reply
Can someone explain to me why advances in actual model performance come from using analogies from physics when there are papers that supposedly provide a mathematical explanation of convolution?

"A Mathematical Theory of Deep ConvolutionalNeural Networks for Feature Extraction":

https://arxiv.org/pdf/1512.06293.pdf

"Understanding Convolutional Neural Networks with A Mathematical Model":

https://arxiv.org/pdf/1609.04112.pdf

[+] conjectures|6 years ago|reply
Because it's not a standard convolutional net by the description. The difference is:

A) Studying an existing technique with math.

B) Coming up with a new technique.

You could get a modern engineering consultancy to review your steam engine, but it would still be a steam engine.

[+] hanniabu|6 years ago|reply
Is multidimensional AI the new 2020 buzzword?
[+] z3c0|6 years ago|reply
I sure hope so. Telling people I build multidimensional data structures for a living has only yielded glazed-over eyes thus far.
[+] etaioinshrdlu|6 years ago|reply
This is super cool and I'm pretty sure this is basically topology. The article was pretty hard to read though. It reminds me a little bit of the https://en.m.wikipedia.org/wiki/Hairy_ball_theorem
[+] improbable22|6 years ago|reply
Not really topology, it's more like group theory, and representations.

Ordinary convnets are a way of building in translational symmetry, which is the group R^2 (in the plane). The work being described extends this to larger symmetry groups, such as rotations of a molecule in 3D (which is SO(3)).

For either of these, you can work in Fourier space instead of real space, where convolutions become products. For ordinary convnets means ordinary FFT, but nobody does that as translating to neighbouring pixels is simple enough. Rotations aren't so simple, and so working in Fourier space can be an efficient way to do things. And the connection to physics is really just that the representation theory of SO(3) is a bread-and-butter exercise there, the basis of atomic theory.

[+] MrQuincle|6 years ago|reply
What happens if there are multiple 3D or 4D objects? Do we then need an attention mechanism as well? Or is there some topology where a "where vs what" pathway emerges naturally?
[+] numlock86|6 years ago|reply
Isn't a 3D object basically a 4D's object surface?
[+] _wzsf|6 years ago|reply
The only valid answer to your question is "probably"
[+] jhisiow9839|6 years ago|reply
Is this very different from a graph convolutional network (GCN)? Seems like a GCN would have a lot of the same equivariabne properties (i.e. orientation, units of measure, etc)?
[+] beefield|6 years ago|reply
_I_ would like to have a VR experience in higher dimensions. It should not be completely impossible to build some kind of actuators that I can somehow attach to my body to sense my orientation and acceleration in fourth dimension.
[+] uj8efdkjfdshf|6 years ago|reply
May I recommend 4D Toys? It's made by the same guy who's developing Miegakure and it has a VR version. The controls are a bit limited though in that user-initiated rotations are limited to 3D.
[+] ganzuul|6 years ago|reply
Something that could go by the same title is the use of tensor networks for ML. I think it works like a pre-optimization step by dimension reduction of the solution space, but if someone could give the right intuitive explanation I'd be much obliged.

It seems to be a way to lessen inductive bias by making decisions about available ML algos. That is, it vastly increases the solution space but remains effective by omitting unlikely solutions.

[+] openasocket|6 years ago|reply
This is a really interesting application of differential geometry in machine learning! And the allusion at the end to having the system eventually learn the symmetries of the system and make sure of that is really interesting. All the examples they gave were very physical, like climate models, but I imagine you could find symmetries in much more abstract problems that may not be intuitive.
[+] cosmic_ape|6 years ago|reply
Hype pipeline: 1) Take a banal feature engineering work. 2) Add Albert Einstein reference. 3) Profit.
[+] sillysaurusx|6 years ago|reply
One of the strangest things in AI, to me, is that you can average the weights of multiple different models to create a single model that’s better than any of the individuals.

This is how distributed training often works, for example. Data parallelism.

I don’t understand why it still works in higher dimensions, but it seems to.

The intuition is that the multiple models are “spinning” around the true solution, so averaging gives the final result more quickly. But it works even early in the training process.

[+] rprenger|6 years ago|reply
When we use data parallelism, we're summing/averaging the gradients induced by different parts of the data, not the weights of the model itself. When using multiple models for ensemble methods, we're summing/averaging the output of the models for a sample. We're not summing/averaging the weights of the models themselves. While averaging the weights of several models might work on a given problem, it definitely doesn't work in general.

w2tanh(w1x) = -1w2tanh(-w1*x)

But if you average the weights in those two equivalent models you get 0.

If you're talking about asynchronous data parallelism, then there can be some averaging of weights, but they all start with the same weights and are re-synched often enough that weights are never too different to break it.

[+] joker3|6 years ago|reply
Concentration of measure. If you have a quantity that depends on a large number of random variables but not too strongly on any small subset of them, it tends to behave like a constant. That's the intuition behind the law of large numbers, the central limit theorem, a bunch of concentration inequalities, and model averaging.
[+] gambler|6 years ago|reply
Are you familiar with a paper called Synergy of Monotonic Rules?

http://jmlr.csail.mit.edu/papers/volume17/16-137/16-137.pdf

The math there is above my paygrade, but it describes a way to structurally combine a certain class of ML models for very significant performance gains. More importantly, it describes why it works.

[+] uoaei|6 years ago|reply
You are averaging weights in distributed training? That seems like it would be rife with pitfalls unless you average after every batch.

I always thought the preferred method was to average the gradient updates, and pass that to update the single mother-model.

[+] im3w1l|6 years ago|reply
I believe it works in distributed training because the models never have time to diverge far enough to be "incompatible".
[+] MiroF|6 years ago|reply
Even stranger is that such a misleading/false claim is upvoted here.
[+] anonytrary|6 years ago|reply
> One of the strangest things in AI, to me, is that you can average the weights of multiple different models to create a single model that’s better than any of the individuals.

I think this is one of the least strange things in AI. All you're doing is taking N overfitted models (unlikely to be overfit in the same way) and then asserting that the average of those predictions is probably not overfitted as much (regularization). Overfitting as a concept is not restricted to some number of dimensions.

[+] pavanky|6 years ago|reply
The weights arent averaged unless the training step is synchronous. Even then most of the times it is the gradients that are added up rather than the actual weights.

For inference, I dont think there are many papers that claim direct average of weights perform better than any single model. It is usually the output that is accumulated in some way.

[+] halflings|6 years ago|reply
As others commented, you either ensemble models (average the predictions) or average the update (gradient).

For ensembling, the mathematical justification for why this surprising result is true (e.g just averaging many weak but different models gives a better model) is pretty interesting: https://en.wikipedia.org/wiki/Condorcet%27s_jury_theorem

[+] ncraig|6 years ago|reply
Works for humans as well (wisdom of crowds).
[+] BubRoss|6 years ago|reply
This article is all over the place with almost no substance. It starts talking about Einstein's theory of relativity, it says this definitively solves curved data and gives almost no insight into what is actually different. It is so bad it makes me want to avoid this site all together.
[+] gojomo|6 years ago|reply
The article includes multiple links to the related underlying research papers, for people who need more substance.
[+] eli_gottlieb|6 years ago|reply
If you think gauge invariance on Riemannian manifolds is an empty topic with no substance, you might not want to be working in machine learning.
[+] cliqueiq|6 years ago|reply
I thought quantamagazine was above publishing click-bait, but I guess not.

Neural networks already see in "higher dimensions" (whatever that means). Anyone who's ever used neural networks already knows each neuron's branch (i.e. dendrite) of an N-sized vector can already be though of as a "dimension" of a data set. CNN (convolutions) flatten that data (reduce it or seeing the same pattern over less "dendrites", much like PCA, etc.).

CNNs only make sense when working with image data anyways.

[+] jhj|6 years ago|reply
> CNNs only make sense when working with image data anyways

Not true, N-dimensional convnets, 1-d convnets (for NLP and time series analysis), spatially sparse convnets, graph and non-Euclidean space convnets, ... exist and are used.

CNNs are akin to multiscale wavelet transforms. They can be applied on different spaces (just as graph wavelet transforms exist).

[+] ipsum2|6 years ago|reply
> CNNs only make sense when working with image data anyways.

Not true, CNNs are used for audio and text as well.

I don't think the title is clickbait, you may be misinterpreting it. Its referring to using CNNs on higher dimensional inputs, not that the layer has multiple dimensions (which has been done since the creation of convnets)

[+] throwaway_tech|6 years ago|reply
Wow this is actually fairly close to my prediction for the HN Next Decade Prediction post.

Edit: I should say it is a big step in the direction of my prediction.

[+] gojomo|6 years ago|reply
What prediction, what post?