top | item 46584855

DroPE: Extending the Context of LLMs by Dropping Their Positional Embeddings

5 points| hardmaru | 1 month ago |pub.sakana.ai

1 comment

order

modeless|1 month ago

> While the original motivation for causal masking was not to provide positional information, but instead to have efficient parallelizable training, it turns out that a consistent <bos> token + causal masking is enough to perfectly reconstruct token positions.

I wish this point was explained further instead of being just a footnote. It seems like the central insight that is essential for this technique to work, and it is not obvious to me, maybe because I haven't implemented a transformer from scratch.