A conventional neural network, i.e. one using a stack of dense layers, can't unroll across a sequence in the way the transformer does. So while it could compute the relative importance and interaction of the features it sees it wouldn't be able to compute that across arbitrary length sequences without a mechanism for the sequence elements to interact, which is what self attention provides.
inciampati|3 years ago