top | item 44801323

(no title)

shpongled | 6 months ago

I looked through their torch implementation and noticed that they are applying RoPE to both query and key matrices in every layer of the transformer - is this standard? I thought positional encodings were usually just added once at the first layer

discuss

m_ke|6 months ago

No they’re usually done at each attention layer.

shpongled|6 months ago

Do you know when this was introduced (or which paper)? AFAIK it's not that way in the original transformer paper, or BERT/GPT-2