(no title)
machinelearning | 1 year ago
Both waste compute because you have to re-encode things as text each time and RAG needs a lot of heuristics + a separate embedding model.
Instead, it makes a lot more sense to pre-compute KV for each document, then compute values for each query. Only surfacing values when the attention score is high enough.
The challenge here is to encode global position information in the surfaced values and to get them to work with generation. I suspect it can't be done out of the box but we it will work with training.
This approach has echoes of both infinite context length and RAG but is an intermediate method that can be parallelized and is more efficient than either one.
Prosammer|1 year ago
machinelearning|1 year ago
Do you have a link? Or maybe you misunderstood what I was taking about