(no title)
thisiszilff | 3 years ago
It's good to ignore self-attention for a moment and take a look at a convolutional network (a CNN). Why is a CNN more effective than just stacks of fully connected layers? Well, instead of just throwing data at the network and telling it to figure out what to do with it, we've built in some prior knowledge into the network. We tell it "you know, a cup is going to be a cup even if it is 10 pixels up or 10 pixels down; even if it is in the upper right of the image or the lower left." We also tell it, "you know, the pixels near a given pixel are going to be pretty correlated with that pixel, much more so than pixels far away." Convolutions help us express that kind of knowledge in the form of a neural network.
Self-attention plays a similar role. We are imbuing our network with an architecture that is aware of the data it is about to receive. We tell it "hey, elements in this sequence have a relation with one another, and that relative location might be important for the answer". Similar to convolutions, we also tell it that the location of various tokens in a sequence is going to vary: there shouldn't be much difference between "Today the dog went to the park" and "The dog went to the park today." Like convolutions, self-attention builds in certain assumptions we have about the data we are going to train the network on.
So yes, you are right that fully-connected layers can emulate similar behavior, but training them to do that isn't easy. With self-attention, we've started with more prior knowledge about the problem at hand, so it is easier to solve.
lukah|3 years ago
I can’t stand it when people lazily personify ML models, but it’s akin to giving someone with no experience some wood and then pointing to a shed and saying “make one of those from this”. Instead you’d expect them to be much more successful if you also give them a saw, a drill, some screws etc.
sooheon|3 years ago
candiodari|3 years ago
FiberBundle|3 years ago
candiodari|3 years ago
Well, a model that is aware of this symmetry only has half as much data to look at and one less thing to learn.
But that's only half of it. Truth is, assuming symmetries works pretty well even if the assumption is wrong. Why? Generalization. A model with less data will generalize more (better is perhaps debatable, but it will definitely generalize more)
This is the basic idea behind "geometric deep learning". There's loads of papers, but here's a presentation.
https://www.youtube.com/watch?v=w6Pw4MOzMuo
sdwr|3 years ago