(no title)
mota7 | 1 year ago
They're effectively doing softmax with a fixed temperature, but it's unclear that this work is going to do better than just learning a per-head temperature parameter.
c.f. https://arxiv.org/abs/2010.04245 which shows an improvement by learning per-head temperature.
The other way to think about this is that it looks like a hacked-up kinda-sorta gated attention. If that's the case, then doing softmax(alphaq_1k_1^T - log_sigmoid(betaq_2k_2^T)) might be better? (where alpha,beta are learned temperatures).
No comments yet.