top | item 45388863

(no title)

spyder | 5 months ago

This is false:

"that's because a next token predictor can't "forget" context. That's just not how it works."

An LSTM is also a next token predictor and literally have a forget gate, and there are many other context compressing models too which remember only the what it thinks is important and forgets the less important, like for example: state-space models or RWKV that work well as LLMs too. But even just a the basic GPT model forgets old context since it's gets truncated if it cannot fit, but that's not really the learned smart forgetting the other models do.

discuss

order

No comments yet.