top | item 34748592

(no title)

craigacp | 3 years ago

A conventional neural network, i.e. one using a stack of dense layers, can't unroll across a sequence in the way the transformer does. So while it could compute the relative importance and interaction of the features it sees it wouldn't be able to compute that across arbitrary length sequences without a mechanism for the sequence elements to interact, which is what self attention provides.

discuss

inciampati|3 years ago

Practical attention implementations don't work over arbitrary length sequences. The universal approximation theorem holds IMO. Information will mix as you go through fully connected MLP layers. Attention is apparently a prior structure that is needed to really reduce training costs.