(no title)
IceHegel | 6 months ago
There's like a significant loss of model sharpness as context goes over 100K. Sometimes earlier, sometimes later. Even using context windows to their maximum extent today, the models are not always especially nuanced over the long ctx. I compact after 100K tokens.
Ozzie_osman|6 months ago
elorant|6 months ago
IceHegel|5 months ago
Because my understandings is that, however you get to 100K, the 100,001st token is generated the same way as far as the model is concerned.
luckydata|6 months ago
IceHegel|5 months ago
If you give a summary+graph to the model, it can still only attend to the summary for token 1. If it's going to call a tool for a deeper memory, it still only gets the summary when it makes the decision on what to call.
You get the same problem when asking the model to make changes in even medium-sized code bases. It starts from scratch each time, takes forever to read a bunch of files, and sometimes it reads the right stuff, other times it doesn't.
spiderfarmer|6 months ago