top | item 40069472

(no title)

Sorry for the late response. I must be misunderstanding your comment. I read your comment as "RAG doesn't pre-compute KV for each document, which is inefficient". With RAG, you convert your text into vectors and then store them in a DB — this is the pre-compute. Then you just need to compute the vector of your query, and search for vector similarity. So it seems to me like RAG doesn't suffer from inefficiency you were saying it suffers from.

discuss

machinelearning|1 year ago

No, you've only discussed the Retrieval part of RAG, not the generation part.

The current workflow is to use the embedding to retrieve documents then dump the text corresponding to the embedding into the LLM context for generation.

Often, the embedding is from a different model from the LLM and it is not compatible with the generation part.

So yea, RAG does not pre-compute the KV for each document.

Prosammer|1 year ago

I see what you're saying now, thanks for clarifying.