top | item 34748584

(no title)

If by "conventional neural network" you mean a stack off fully connected layers, then yes, in theory one of those could learn a similar mechanism because of the universal approximation theorem. However, training one might be intractable.

It's good to ignore self-attention for a moment and take a look at a convolutional network (a CNN). Why is a CNN more effective than just stacks of fully connected layers? Well, instead of just throwing data at the network and telling it to figure out what to do with it, we've built in some prior knowledge into the network. We tell it "you know, a cup is going to be a cup even if it is 10 pixels up or 10 pixels down; even if it is in the upper right of the image or the lower left." We also tell it, "you know, the pixels near a given pixel are going to be pretty correlated with that pixel, much more so than pixels far away." Convolutions help us express that kind of knowledge in the form of a neural network.

Self-attention plays a similar role. We are imbuing our network with an architecture that is aware of the data it is about to receive. We tell it "hey, elements in this sequence have a relation with one another, and that relative location might be important for the answer". Similar to convolutions, we also tell it that the location of various tokens in a sequence is going to vary: there shouldn't be much difference between "Today the dog went to the park" and "The dog went to the park today." Like convolutions, self-attention builds in certain assumptions we have about the data we are going to train the network on.

So yes, you are right that fully-connected layers can emulate similar behavior, but training them to do that isn't easy. With self-attention, we've started with more prior knowledge about the problem at hand, so it is easier to solve.

discuss

lukah|3 years ago

Great answer. Imbuing a deep learning model with well thought out inductive biases is one of the strongest ways of guiding your model to interpret the data the way you want it to. Otherwise it’s kind of shooting in the dark and hoping to get lucky.

I can’t stand it when people lazily personify ML models, but it’s akin to giving someone with no experience some wood and then pointing to a shed and saying “make one of those from this”. Instead you’d expect them to be much more successful if you also give them a saw, a drill, some screws etc.

sooheon|3 years ago

Good explanation. Which is why the success of transformers, LLMs etc. is still not the final word in Rich Sutton's "The Bitter Lesson" -- no learning method is free of inductive biases.

candiodari|3 years ago

Inductive biases can work even if they're wrong, because they allow for simple and quick action, simpler reasoning. They don't need to be correct for that to pay off, they just need a positive expected value.

FiberBundle|3 years ago

Are there actually theorems from which you take those explanations or are these just plausibly sounding hypothetical explanations?

candiodari|3 years ago

You can verify the reduction of the problem space. Think of it this way, if data has some property, for example, it's mirrored on an axis. If X is a data point, then so is -X.

Well, a model that is aware of this symmetry only has half as much data to look at and one less thing to learn.

But that's only half of it. Truth is, assuming symmetries works pretty well even if the assumption is wrong. Why? Generalization. A model with less data will generalize more (better is perhaps debatable, but it will definitely generalize more)

This is the basic idea behind "geometric deep learning". There's loads of papers, but here's a presentation.

https://www.youtube.com/watch?v=w6Pw4MOzMuo

sdwr|3 years ago

Great answer!