top | item 44801323 (no title) shpongled | 6 months ago I looked through their torch implementation and noticed that they are applying RoPE to both query and key matrices in every layer of the transformer - is this standard? I thought positional encodings were usually just added once at the first layer discuss order hn newest m_ke|6 months ago No they’re usually done at each attention layer. shpongled|6 months ago Do you know when this was introduced (or which paper)? AFAIK it's not that way in the original transformer paper, or BERT/GPT-2 load replies (3)
m_ke|6 months ago No they’re usually done at each attention layer. shpongled|6 months ago Do you know when this was introduced (or which paper)? AFAIK it's not that way in the original transformer paper, or BERT/GPT-2 load replies (3)
shpongled|6 months ago Do you know when this was introduced (or which paper)? AFAIK it's not that way in the original transformer paper, or BERT/GPT-2 load replies (3)
m_ke|6 months ago
shpongled|6 months ago