top | item 44712975

(no title)

Very much agree re: inscrutability. It gets even more complicated when you add the LLM-specific concept of rotary positional embeddings to the mix. In my experience, it's been exceptionally hard to communicate that concept to even technical folks that may understand (at a high level) the concept of semantic similarity via something like cosine distance.

I've come up with so many failed analogies at this point, I lost count (the concept of fast and slow clocks to represent the positional index / angular rotation has been the closest I've come so far).

discuss

krackers|7 months ago

I've read that "No Position Embedding" seems to be better for long-term context anyway, so it's probably not something essential to explain.

spmurrayzzz|7 months ago

Do you have a citation for the paper on that? IME, that's not really something you see used in practice, at least not after 2022 or so. Without some form positional adjustment, transformer-based LLMs have no way to differentiate from "The dog bit the man." and "The man bit the dog." given the token ids are nearly identical. You just end up back in the bag-of-words problem space. The self-attention mechanism is permutation-invariant, so as long as it remains true that the attention scores are computed as an unordered set, you need some way to model the sequence.

Long context is almost always some form of RoPE in practice (often YaRN these days). We can't confirm this with the closed-source frontier models, but given that all the long context models in the open weight domain are absolutely encoding positional data, coupled with the fact that the majority of recent and past literature corroborates its use, we can be reasonably sure they're using some form of it there as well.

EDIT: there is a recent paper that addresses the sequence modeling problem in another way, but its somewhat orthogonal to the above as they're changing the tokenization method entirely https://arxiv.org/abs/2507.07955