top | item 41779695

(no title)

x1000 | 1 year ago

Could you help explain how we would achieve an attention score of exactly 0, in practice? Here’s my take:

If we’re subtracting one attention matrix from another, we’d end up with attention scores between -1 and 1, with a probability of effectively 0 for any single entry to exactly equal 0.

What’s more, the learnable parameter \lambda allows for negative values. This would allow the model to learn to actually add the attention scores, making a score of exactly 0 impossible.

discuss

jszymborski|1 year ago

Your comment brings up two interesting variants that could be interesting if your goal is to increase the sparsity of the attention:

- Rectify the difference of the softmaxes. (min(0, s(A1) - lambda s(A2)))

- Apply the Heaviside function to the second softmax. (softmax(A1) - lambda H(s(A1) - lambda s(A2))

The second one being a bit more drastic and maybe harder to train.