top | item 41927789

(no title)

bcherry | 1 year ago

It's kind of interesting because I think most people implementing RAG aren't even thinking about tokenization at all. They're thinking about embeddings:

1. chunk the corpus of data (various strategies but they're all somewhat intuitive)

2. compute embedding for each chunk

3. generate search query/queries

4. compute embedding for each query

5. rank corpus chunks by distance to query (vector search)

6. construct return values (e.g chunk + surrounding context, or whole doc, etc)

So this article really gets at the importance of a hidden, relatively mundane-feeling, operation that occurs which can have an outsized impact on the performance of the system. I do wish it had more concrete recommendations in the last section and code sample of a robust project with normalization, fine-tuning, and eval.

discuss

No comments yet.