top | item 41777542

(no title)

pxdm | 1 year ago

What's the comparison with conventional attention using a more aggressive (lower temperature) softmax? I can imagine that for the multi-needle retrieval test this may also give a performance boost, although at some cost other more creative tasks.

discuss

mota7|1 year ago

I had the same thought: Just eye-balling the graphs, the result of the subtraction looks very close to just reducing the temperature.

They're effectively doing softmax with a fixed temperature, but it's unclear that this work is going to do better than just learning a per-head temperature parameter.

c.f. https://arxiv.org/abs/2010.04245 which shows an improvement by learning per-head temperature.

The other way to think about this is that it looks like a hacked-up kinda-sorta gated attention. If that's the case, then doing softmax(alphaq_1k_1^T - log_sigmoid(betaq_2k_2^T)) might be better? (where alpha,beta are learned temperatures).