When I prototype RAG systems I don’t use a “vector database.” I just use a pandas dataframe and I do an apply() with a cosine distance function that is one line of code. I’ve done it with up to 1k rows and it still takes less than a second.
This is exactly what I do. No one talks about how many GPUs you need to generate enough embeddings that you need to do something else.
Here's some back of the envelope math. Let's say you are using a 1B parameter LLM to generate the embedding. That's 2B FLOPs per token. Let's assume a modest chunk size, 2K tokens. That's 4 trillion FLOPs for one embedding.
What about the dot product in the cosine similarity? Let's assume an embedding dim of 384. That's 2 * 384 = 768.
So 4 trillion ops for the embedding vs 768 for the cosine similarity. That's a factor of about 1 billion.
So you could have a billion embeddings - brute forced - before the lookup became more expensive than generating the embedding.
What does that mean at the application level? It means that the time needed to generate millions of embeddings is measured in GPU weeks.
The time needed to lookup an embedding using an approximate nearest neighbors algorithm from millions of embeddings is measured in milliseconds.
The game changed when we switched from word2vec to LLMs to generate embeddings.
1 billion times is such a big difference that it breaks the assumptions earlier systems were designed under.
The embedding is generated once. Search is done whenever a user inputs a query. The cosine similarity is also not done on a single embedding, it's done on millions or billions of embeddings if you are not using an index. So what the actual conclusion is, is that once you have a billion embeddings a single search operation costs as much as generating an embedding.
But then, you are not even taking into account the massive cost of keeping all of these embeddings in memory ready to be searched.
Everyone is piling on you but Id love to see what their companies are doing. Cosine similarity and loading a few thousand rows sounds trivial but most of the enterprise/b2b chat/copilot apps have a relatively small amount of data whose embeddings can fit in RAM. Combine that with natural sharding by customer ID and it turns out vector DBs are much more niche than an RDBMS. I suspect most people reaching for them haven’t done the calculus :/
1k rows isn't really at a point where you need any form of database. Vector or BOW, you can just bruteforce the search with such a miniscule amount of data (arguably this should be true into the low millions).
The problem is what happens when you have an additional 6 orders of magnitude of data, and the data itself is significantly larger than the system RAM, which is a very realistic case in a search engine.
1k is not much. My first RAG had over 40K docs (all short, but still...)
The one I'm working on right now has 115K docs (some quite big - I'll likely have to prune the largest 10% just to fit in my RAM).
These are all "small" - for personal use on my local machine. I'm currently RAM limited, otherwise I can think of (personal) use cases that are an order of magnitude larger.
Of course, for all I know, your method may still be as fast on those as on a vector DB.
I must be missing something -- why is the size of the documents a factor? If you embeded a document it would become a vector of ~1k floats, and 115k*1k floats is a couple hundred MB, trivial to fit in modern day RAM.
Even on the production side there is something to be said about just doing things in memory, even over larger datasets. Certainly like all things there is a possible scale issue but I would much rather spin up a dedicated machine with a lot of memory than pay some of the wildly high fees for a Vector DB.
Not sure if others have gone down this path but I have been testing out ways to store vectors to disk in files for later retrieval and then doing everything in memory. For me the tradeoff of a sligtly slower response time was worth it compared to the 4-5 figure bill I would be getting from a vector DB otherwise.
Also, you are probably doing it wrong by turning a matrix to matrix multiplication into a for loop (over rows). The optimal solution results in better performance
There is certainly some scale at which a more sophisticated approach is needed. But your method (maybe with something faster than python/pandas) should be the go-to for demonstration and kept until it's determined that the brute force search is the bottleneck.
This issue is prevalent throughout infrastructure projects. Someone decides they need a RAG system and then the team says "let's find a vector db provider!" before they've proven value or understood how much data they have or anything. So they waste a bunch of time and money before they even know if the project is likely to work.
It's just like the old model of setting up a hadoop cluster as a first step to do "big data analytics" on what turns out to be 5GB of data that you could fit in a dataframe or process with awk https://adamdrake.com/command-line-tools-can-be-235x-faster-... (edit: actually currently on the HN front page)
It's a perfedt storm of sales led tooling where leadership is sold something they don't understand, over-engineering, and trying to apply waterfall project management to "AI" projects that have lots of uncertainty and need a re-risking based project approach where you show that it's liable to work and iterate instead of building a big foundation first.
Even up to 1M or so rows you can just store everything in a numpy array or PyTorch tensor and compute similarity directly between your query embedding and the entire database. Will be much faster than the apply() and still feasible to run on a laptop.
You may benefit from polars, it can multi-core better than pandas, and has some of the niceties from Arrow (which was the written / championed by the power duo of Wes and Hadley, authors of pandas and the R - tidyverse respectively).
I agree pandas or whatever data frame library you like is ideal for prototyping and exploring than setting up a bunch of infrastructure in a dev environment. Especially if you have labels and are evaluating against a ground truth.
You might be interested in SearchArray which emulates the classic search index side of things in a pandas dataframe column
Thanks for the article and definitely agree you are better off to start it simple like a parquet file and faiss and then test out options with your data. I say that mainly to test chunking strategies because of how big an effect it has on everything downstream whatever vector db or bert path you take -- chunking is a much bigger impact source than most people acknowledge.
I'm expecting to deploy a 6-figure "row count" RAG in the near future... with CTranslate2, matmul-based, at most lightly (like, single digits?) batched, and probably defaulting to CPU because the encoder-decoder part of the RAG process is just way more expensive and the database memory hog along with relatively poor TopK performance isn't worth the GPU.
That's kinda why I use LanceDB. It works on all three OSes, doesn't require large installs, and is quite easy to use. The files are also just Parquet, so no need to deall with SQL.
gdiamos|2 years ago
Here's some back of the envelope math. Let's say you are using a 1B parameter LLM to generate the embedding. That's 2B FLOPs per token. Let's assume a modest chunk size, 2K tokens. That's 4 trillion FLOPs for one embedding.
What about the dot product in the cosine similarity? Let's assume an embedding dim of 384. That's 2 * 384 = 768.
So 4 trillion ops for the embedding vs 768 for the cosine similarity. That's a factor of about 1 billion.
So you could have a billion embeddings - brute forced - before the lookup became more expensive than generating the embedding.
What does that mean at the application level? It means that the time needed to generate millions of embeddings is measured in GPU weeks.
The time needed to lookup an embedding using an approximate nearest neighbors algorithm from millions of embeddings is measured in milliseconds.
The game changed when we switched from word2vec to LLMs to generate embeddings.
1 billion times is such a big difference that it breaks the assumptions earlier systems were designed under.
brigadier132|2 years ago
The embedding is generated once. Search is done whenever a user inputs a query. The cosine similarity is also not done on a single embedding, it's done on millions or billions of embeddings if you are not using an index. So what the actual conclusion is, is that once you have a billion embeddings a single search operation costs as much as generating an embedding.
But then, you are not even taking into account the massive cost of keeping all of these embeddings in memory ready to be searched.
omeze|2 years ago
coffeebeqn|2 years ago
marginalia_nu|2 years ago
The problem is what happens when you have an additional 6 orders of magnitude of data, and the data itself is significantly larger than the system RAM, which is a very realistic case in a search engine.
BeetleB|2 years ago
The one I'm working on right now has 115K docs (some quite big - I'll likely have to prune the largest 10% just to fit in my RAM).
These are all "small" - for personal use on my local machine. I'm currently RAM limited, otherwise I can think of (personal) use cases that are an order of magnitude larger.
Of course, for all I know, your method may still be as fast on those as on a vector DB.
hnfong|2 years ago
infecto|2 years ago
Not sure if others have gone down this path but I have been testing out ways to store vectors to disk in files for later retrieval and then doing everything in memory. For me the tradeoff of a sligtly slower response time was worth it compared to the 4-5 figure bill I would be getting from a vector DB otherwise.
ninja3925|2 years ago
Also, you are probably doing it wrong by turning a matrix to matrix multiplication into a for loop (over rows). The optimal solution results in better performance
sim = np.vstack(df.col) @ vec
andy99|2 years ago
This issue is prevalent throughout infrastructure projects. Someone decides they need a RAG system and then the team says "let's find a vector db provider!" before they've proven value or understood how much data they have or anything. So they waste a bunch of time and money before they even know if the project is likely to work.
It's just like the old model of setting up a hadoop cluster as a first step to do "big data analytics" on what turns out to be 5GB of data that you could fit in a dataframe or process with awk https://adamdrake.com/command-line-tools-can-be-235x-faster-... (edit: actually currently on the HN front page)
It's a perfedt storm of sales led tooling where leadership is sold something they don't understand, over-engineering, and trying to apply waterfall project management to "AI" projects that have lots of uncertainty and need a re-risking based project approach where you show that it's liable to work and iterate instead of building a big foundation first.
thelastparadise|2 years ago
These days anything less than 2TB should be done 100% in memory.
jxmorris12|2 years ago
baldeagle|2 years ago
softwaredoug|2 years ago
You might be interested in SearchArray which emulates the classic search index side of things in a pandas dataframe column
https://github.com/softwaredoug/searcharray
jimmySixDOF|2 years ago
namibj|2 years ago
zzleeper|2 years ago
Uncroyable|2 years ago
unknown|2 years ago
[deleted]
petters|2 years ago
You'll realize that it scales well beyond 1k.
visarga|2 years ago
whalesalad|2 years ago
visarga|2 years ago
jjtheblunt|2 years ago
hcks|2 years ago