top | item 42172280

(no title)

aconz2 | 1 year ago

To add on since this took me a while to understand: for a single token, self attention is permutation invariant because we take the qK (one query dot all the other keys) weighted sum of all the values; that sum is what gives the invariance because + is commutative. But for all the tokens, the mha output matrix will not be invariant, but rather equivariant, where you apply the same permutation to the output matrix as you did to the input tokens. What might be a more useful example is to take one position, like the last one, and compute its mha for every permutation of the previous tokens; those will/should all be the same.

discuss

No comments yet.