With 1M tokens, if snapshotting the LLM state is cheap, it would beat out-of-the-box nearly all RAG setups, except the ones dealing with large datasets. 1M tokens is a lot of docs.
Yeah, but latency is still a factor here. Any follow-up question requires re-scanning the whole context, which often takes a long time. IIRC when Google showed their demos for this use case each request took over 1 minute for ~650k tokens.
phillipcarter|2 years ago