I also found pure RAG with vector search to not work. I was creating a bot that could find answers to questions about things by looking at Slack discussions.
At first, I downloaded entire channels, loaded them into a vector DB, and did RAG. The results sucked. Vector searches don't understand things very well, and in this world, specific keywords and error messages are very searchable.
Instead, I take the user's query, ask an LLM (Claude / Bedrock) to find keywords, then search Slack using the API, get results, and use an LLM to filter for discussions that are relevant, then summarize them all in a response.
This is slow, of course, so it's very multi-threaded. A typical response will be within 30 seconds.
For decades we had search engines based on the query terms (keywords). Then there were lots of discussions and some implementations to put a semantic search on top of it to improve the keyword search. A hybrid search. Google Search did exactly that already in 2015 [1].
Now we start from pure semantic search and put keyword search on top of it to improve the semantic search and call it hybrid search.
In both approaches, the overall search performance is exactly identical - to the last digit.
I am glad, that so far, no one has called this an innovation. But you could certainly write a lot of blog articles about it.
When you’re creating your embedding you can store keywords from the content (using an LLM) in the metadata of each chunk which would positively increase the relevancy of results turned from the retrieval.
Zero shot key phrase extraction is a reasonably well-studied field. I don’t know what the current SOTA is, but the one that was pretty hot shit last time I needed one was kbir-inspec which is on HuggingFace and you can test it right on the page.
Might be worth a shot if performance is a tricky spot in your setup.
Thanks for sharing, I like the approach and it makes a lot of sense for the problem space. Especially using existing products vs building/hosting your own.
I was however tripped up by this sentence close to the beginning:
> we encountered a significant challenge with RAG: relying solely on vector search (even using both dense and sparse vectors) doesn’t always deliver satisfactory results for certain queries.
Not to be overly pedantic, but that's a problem with vector similarity, not RAG as a concept.
Although the author is clearly aware of that - I have had numerous conversations in the past few months alone of people essentially saying "RAG doesn't work because I use pg_vector (or whatever) and it never finds what I'm looking for" not realizing 1) it's not the only way to do RAG, and 2) there is often a fair difference between the embeddings and the vectorized query, and with awareness of why that is you can figure out how to fix it.
Author here: you're for sure right -- it's not a problem with RAG the theoretical concept. In fact, I think RAG implementations should likely be specific to their use cases (e.g. our hybrid search approach works well for customer support, but I'm not sure if it would work as well in other contexts, say for legal bots).
I've seen the whole gamut of RAG implementations as well, and the implementation, specifically prompting and the document search has a lot to do with the end quality.
> Not to be overly pedantic, but that's a problem with vector similarity, not RAG as a concept.
Vector similarity has a surprising failure mode. It only indexes explicit information, missing out the implicit one.
For example "The second word of this phrase, decremented by one" is "first", do you think these strings will embed the same? Calculated results don't retrieve well. Also, deductions in general.
How about "I agree with what John said, but I'd rather apply Victor's solution"? It won't embed like the answer you seek. Multi-hop information seeking questions don't retrieve well.
The obvious fix is to pre-ingest all the RAG text into a LLM and calculate these deductions before embedding.
Having worked building out a RAG SaaS platform for the past year and having worked on the vendor side of several keyword-based search systems in the past 10 years, I can say it's absolutely necessary to have some kind of hybrid search for most use cases I've seen.
The problem is that most people don't have experience optimizing even 1 of the retrieval systems (vector or keyword), so a lot of users that try to DIY build end up with an awful time trying to get to prod. People are talking about things like RRF (which are needed) but then missing other big-picture things like the mistakes everyone makes when building out a keyword search (not getting the right language rules in place) and also not getting the right vector side (finding the right embedding models, chunking strategies, etc).
I recognize I have a bit of a conflict of interest since I'm at a RAG vendor, but I'll abstain from the name/self-promotion and say: I've seen so many cases where people get this wrong, if you're thinking RAG you really should be hiring a consultant or looking at a complete platform from people that have done it more. Or be prepared to spend a lot of cycles learning and iterating
People dramatically underestimate the complexity of even reasonably relevant search systems.
One reason is unlike other data products - it’s an active, conscious action of users. If ads or recommendations are wrong nobody gets mad. But screw up search and it’s like the shop sales person taking you to the wrong aisle. It’s actively frustrating.
So basically every useful search system is disliked to some degree because it will get some things wrong some of the time.
We also included supporting data in that write up showing you can improve significantly on top of Hybrid/RRF using a reranking stage (assuming you have a good reranker model), so we shipped one as an optional step as part of our search engine.
RRF is a simple and effective means of fused ranking for multiple recall.
Within our open source RAG product RAGFlow(https://github.com/infiniflow/ragflow), Elasticsearch is currently used instead of other general vector databases, because it can provide hybrid search right now. Under the default cases, embedding based reranker is not required, just RRF is enough, while even if reranker is used, keywords based retrieval is also a MUST to be hybridized with embedding based retrieval, that's just what RAGFlow's latest 0.7 release has provided.
From the next version(weeks later), Infinity will also provide more comprehensive hybrid search capabilities, what you have mentioned the 3-way recalls(dense vector, sparse vector, keyword search) could be provided within single request.
pg_search (full text search Postgres extension) can be used with pgvector for hybrid search over Postgres tables. It comes with a helpful hybrid search function that uses relative score fusion. Whereas rank fusion considers just the order of the results, relative score fusion uses the actual metrics outputted by text/vector search.
I've implemented a very similar RAG hybrid solution, and it has improved LLM responses enormously. There are other things you can do too that have huge improvements, like destructuring your data and placing it into a graph structure, with queryable edge relationships. I think we're just scratching the surface.
This is really interesting, do you have other recommendations for improvements (gladly with sources I you have any)? I have to build a RAG solution for my job and right now I am collecting information to determine the best way to go ahead.
Reciprocal rank scoring is just one way of forcing scores into a fixed distribution: in this case, decaying with the reciprocal of its rank. But it also assumes fixed weight from all components, i.e. the top ranked keyword match has equal relevance to the top ranked semantic match.
There are a couple ways around this. Either learning the relative importance based on the query, and/or using a separate reranking function (usually a DNN) that also takes user behavior into account.
So I'm not sure why the article uses 1/Rank alone. Did you test both and find that the smoothing didn't help? My understanding is that it has been pretty important for the best results.
The composability of RRF is definitely one of its most appealing characteristics. It doesn't matter what algorithm or vendor you have, you can just fuse with ranks alone. I've seen it shine when fusing lexical and vector search results where semantic attributes like styles and exact attributes like quantities are mixed together in queries, e.g., "modern formal watch with 40mm face".
While it's not such a problem in RAG, one downside is that it complicates pagination for results (there are a few different ways to tackle this).
Pardon my ignorance but I was hung up on this line.
> Out-of-sync document stores could lead to subtle bugs, such as a document being present in one store but not another.
But then the article suggests to upload synchronously in S3/DDB and then sync asynchronously to actual document stores. How does this solve out of sync issue?
It doesn't. It can't be solved is what I'm thinking.
As soon as the indexed documents contain lingo of any kind you need hybrid search IMHO.
Additionally, if you can add conditional fuzzy matching into the mix so fat fingering something still yields a workable result is even better for UX (something along the lines of "the results from the tf-idf search are garbage, let's redo the search with fuzzy matching this time).
I have no doubt this probably produces better results than a simple vector search, but you cannot escape the fact that you are converting a query to a set of results, and so the quality and intent of the query matter. In fact, it matters more than the search mechanics. Anyone who has ever used a search engine or some other search mechanism knows that intuitively.
Hybrid might work for English but where are you going to get sparse embeddings like SPLADE or ELSERv2 for most other languages? Vector search with ada-002 or text-003-large capped to the first 500-1000 dimensions will give you a support for 100+ languages for free. If you are using BM25, then you need to train BM25 on every single separate knowledge base which is annoying and expensive.
Great article. Hybrid search works well for a lot of scenarios.
The tradeoffs of using existing systems vs building your own resonate with me. What we eventually experienced, however, is that periods of bad search performance often correlated to out-of-date search indices.
I'd be interested in another article detailing how you monitor search. It can be tricky to keep an entire search system moving.
1. Does anyone know a postgres reranking extension, to go beyond RRF through ML models or at least custom code?
2. If anyone is observing significant gains from incorporating knowledge graphs into the retrieval step, what kind of a knowledge graph are you working with, what is your retrieval algorithm, and what technology are you using to store it?
[+] [-] pamelafox|1 year ago|reply
https://github.com/Azure-Samples/rag-postgres-openai-python/
Here's the RRF+Hybrid part: https://github.com/Azure-Samples/rag-postgres-openai-python/...
That's largely based off a sample from the pgvector repo, with a few tweaks.
Agreed that Hybrid is the way to go, it's what the Azure AI Search team also recommends, based off their research:
https://techcommunity.microsoft.com/t5/ai-azure-ai-services-...
[+] [-] cpursley|1 year ago|reply
[+] [-] thefourthchime|1 year ago|reply
At first, I downloaded entire channels, loaded them into a vector DB, and did RAG. The results sucked. Vector searches don't understand things very well, and in this world, specific keywords and error messages are very searchable.
Instead, I take the user's query, ask an LLM (Claude / Bedrock) to find keywords, then search Slack using the API, get results, and use an LLM to filter for discussions that are relevant, then summarize them all in a response.
This is slow, of course, so it's very multi-threaded. A typical response will be within 30 seconds.
[+] [-] s-macke|1 year ago|reply
For decades we had search engines based on the query terms (keywords). Then there were lots of discussions and some implementations to put a semantic search on top of it to improve the keyword search. A hybrid search. Google Search did exactly that already in 2015 [1].
Now we start from pure semantic search and put keyword search on top of it to improve the semantic search and call it hybrid search.
In both approaches, the overall search performance is exactly identical - to the last digit.
I am glad, that so far, no one has called this an innovation. But you could certainly write a lot of blog articles about it.
[1] https://searchengineland.com/semantic-search-entity-based-se...
[+] [-] siquick|1 year ago|reply
LlamaIndex does this out of the box.
[+] [-] qeternity|1 year ago|reply
[+] [-] benreesman|1 year ago|reply
Might be worth a shot if performance is a tricky spot in your setup.
[+] [-] edude03|1 year ago|reply
I was however tripped up by this sentence close to the beginning:
> we encountered a significant challenge with RAG: relying solely on vector search (even using both dense and sparse vectors) doesn’t always deliver satisfactory results for certain queries.
Not to be overly pedantic, but that's a problem with vector similarity, not RAG as a concept.
Although the author is clearly aware of that - I have had numerous conversations in the past few months alone of people essentially saying "RAG doesn't work because I use pg_vector (or whatever) and it never finds what I'm looking for" not realizing 1) it's not the only way to do RAG, and 2) there is often a fair difference between the embeddings and the vectorized query, and with awareness of why that is you can figure out how to fix it.
https://medium.com/@cdg2718/why-your-rag-doesnt-work-9755726... basically says everything I often say to people with RAG/vector search problems but again, seems like the assembled team has it handled :)
[+] [-] johnjwang|1 year ago|reply
I've seen the whole gamut of RAG implementations as well, and the implementation, specifically prompting and the document search has a lot to do with the end quality.
[+] [-] visarga|1 year ago|reply
Vector similarity has a surprising failure mode. It only indexes explicit information, missing out the implicit one. For example "The second word of this phrase, decremented by one" is "first", do you think these strings will embed the same? Calculated results don't retrieve well. Also, deductions in general.
How about "I agree with what John said, but I'd rather apply Victor's solution"? It won't embed like the answer you seek. Multi-hop information seeking questions don't retrieve well.
The obvious fix is to pre-ingest all the RAG text into a LLM and calculate these deductions before embedding.
[+] [-] eskibars|1 year ago|reply
The problem is that most people don't have experience optimizing even 1 of the retrieval systems (vector or keyword), so a lot of users that try to DIY build end up with an awful time trying to get to prod. People are talking about things like RRF (which are needed) but then missing other big-picture things like the mistakes everyone makes when building out a keyword search (not getting the right language rules in place) and also not getting the right vector side (finding the right embedding models, chunking strategies, etc).
I recognize I have a bit of a conflict of interest since I'm at a RAG vendor, but I'll abstain from the name/self-promotion and say: I've seen so many cases where people get this wrong, if you're thinking RAG you really should be hiring a consultant or looking at a complete platform from people that have done it more. Or be prepared to spend a lot of cycles learning and iterating
[+] [-] softwaredoug|1 year ago|reply
One reason is unlike other data products - it’s an active, conscious action of users. If ads or recommendations are wrong nobody gets mad. But screw up search and it’s like the shop sales person taking you to the wrong aisle. It’s actively frustrating.
So basically every useful search system is disliked to some degree because it will get some things wrong some of the time.
[+] [-] matthew_mg|1 year ago|reply
Don't think it's overly self-promotional if first asked :)
If you still don't wanna say, feel free to email, email in profile
[+] [-] pmc00|1 year ago|reply
We also included supporting data in that write up showing you can improve significantly on top of Hybrid/RRF using a reranking stage (assuming you have a good reranker model), so we shipped one as an optional step as part of our search engine.
[+] [-] cheesyFish|1 year ago|reply
LlamaIndex has a module for exactly this
https://docs.llamaindex.ai/en/stable/examples/retrievers/rel...
[+] [-] yingfeng|1 year ago|reply
On the other hand let me introduce another database we developed, Infinity(https://github.com/infiniflow/infinity), which can provide the hybrid search, you can see the performance here(https://github.com/infiniflow/infinity/blob/main/docs/refere...), both vector search and full-text search could perform much faster than other open source alternatives.
From the next version(weeks later), Infinity will also provide more comprehensive hybrid search capabilities, what you have mentioned the 3-way recalls(dense vector, sparse vector, keyword search) could be provided within single request.
[+] [-] testfoo444|1 year ago|reply
[+] [-] retakeming|1 year ago|reply
[+] [-] throwaway115|1 year ago|reply
[+] [-] Mkengine|1 year ago|reply
[+] [-] janalsncm|1 year ago|reply
There are a couple ways around this. Either learning the relative importance based on the query, and/or using a separate reranking function (usually a DNN) that also takes user behavior into account.
[+] [-] ko_pivot|1 year ago|reply
[+] [-] gregnr|1 year ago|reply
(disclaimer: supabase dev who went down the rabbit hole with hybrid search)
[+] [-] SomewhatLikely|1 year ago|reply
So I'm not sure why the article uses 1/Rank alone. Did you test both and find that the smoothing didn't help? My understanding is that it has been pretty important for the best results.
[+] [-] johnjwang|1 year ago|reply
We used 1/Rank in the article for simplicity purposes, though I can see why this might be confusing to an astute reader.
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] owen-elliott|1 year ago|reply
While it's not such a problem in RAG, one downside is that it complicates pagination for results (there are a few different ways to tackle this).
[+] [-] cricketlover|1 year ago|reply
> Out-of-sync document stores could lead to subtle bugs, such as a document being present in one store but not another.
But then the article suggests to upload synchronously in S3/DDB and then sync asynchronously to actual document stores. How does this solve out of sync issue? It doesn't. It can't be solved is what I'm thinking.
> Data, numbers
How much data are we talking about?
[+] [-] marcyb5st|1 year ago|reply
Additionally, if you can add conditional fuzzy matching into the mix so fat fingering something still yields a workable result is even better for UX (something along the lines of "the results from the tf-idf search are garbage, let's redo the search with fuzzy matching this time).
[+] [-] _pdp_|1 year ago|reply
[+] [-] treprinum|1 year ago|reply
[+] [-] mtbarta3|1 year ago|reply
The tradeoffs of using existing systems vs building your own resonate with me. What we eventually experienced, however, is that periods of bad search performance often correlated to out-of-date search indices.
I'd be interested in another article detailing how you monitor search. It can be tricky to keep an entire search system moving.
[+] [-] esafak|1 year ago|reply
2. If anyone is observing significant gains from incorporating knowledge graphs into the retrieval step, what kind of a knowledge graph are you working with, what is your retrieval algorithm, and what technology are you using to store it?
[+] [-] pamelafox|1 year ago|reply
I'm not using that in my own experiments since I don't want to worry about the performance of running a model on production, but seems worth a try.