(no title)
pxdm
|
1 year ago
What's the comparison with conventional attention using a more aggressive (lower temperature) softmax? I can imagine that for the multi-needle retrieval test this may also give a performance boost, although at some cost other more creative tasks.
mota7|1 year ago
They're effectively doing softmax with a fixed temperature, but it's unclear that this work is going to do better than just learning a per-head temperature parameter.
c.f. https://arxiv.org/abs/2010.04245 which shows an improvement by learning per-head temperature.
The other way to think about this is that it looks like a hacked-up kinda-sorta gated attention. If that's the case, then doing softmax(alphaq_1k_1^T - log_sigmoid(betaq_2k_2^T)) might be better? (where alpha,beta are learned temperatures).