(no title)
x1000 | 1 year ago
If we’re subtracting one attention matrix from another, we’d end up with attention scores between -1 and 1, with a probability of effectively 0 for any single entry to exactly equal 0.
What’s more, the learnable parameter \lambda allows for negative values. This would allow the model to learn to actually add the attention scores, making a score of exactly 0 impossible.
jszymborski|1 year ago
- Rectify the difference of the softmaxes. (min(0, s(A1) - lambda s(A2)))
- Apply the Heaviside function to the second softmax. (softmax(A1) - lambda H(s(A1) - lambda s(A2))
The second one being a bit more drastic and maybe harder to train.