top | item 41781277

(no title)

mota7 | 1 year ago

I had the same thought: Just eye-balling the graphs, the result of the subtraction looks very close to just reducing the temperature.

They're effectively doing softmax with a fixed temperature, but it's unclear that this work is going to do better than just learning a per-head temperature parameter.

c.f. https://arxiv.org/abs/2010.04245 which shows an improvement by learning per-head temperature.

The other way to think about this is that it looks like a hacked-up kinda-sorta gated attention. If that's the case, then doing softmax(alphaq_1k_1^T - log_sigmoid(betaq_2k_2^T)) might be better? (where alpha,beta are learned temperatures).

discuss

No comments yet.