WingNews

gdiamos|2 years ago

This is exactly what I do. No one talks about how many GPUs you need to generate enough embeddings that you need to do something else.

Here's some back of the envelope math. Let's say you are using a 1B parameter LLM to generate the embedding. That's 2B FLOPs per token. Let's assume a modest chunk size, 2K tokens. That's 4 trillion FLOPs for one embedding.

What about the dot product in the cosine similarity? Let's assume an embedding dim of 384. That's 2 * 384 = 768.

So 4 trillion ops for the embedding vs 768 for the cosine similarity. That's a factor of about 1 billion.

So you could have a billion embeddings - brute forced - before the lookup became more expensive than generating the embedding.

What does that mean at the application level? It means that the time needed to generate millions of embeddings is measured in GPU weeks.

The time needed to lookup an embedding using an approximate nearest neighbors algorithm from millions of embeddings is measured in milliseconds.

The game changed when we switched from word2vec to LLMs to generate embeddings.

1 billion times is such a big difference that it breaks the assumptions earlier systems were designed under.

brigadier132|2 years ago

This analysis is bad.

The embedding is generated once. Search is done whenever a user inputs a query. The cosine similarity is also not done on a single embedding, it's done on millions or billions of embeddings if you are not using an index. So what the actual conclusion is, is that once you have a billion embeddings a single search operation costs as much as generating an embedding.

But then, you are not even taking into account the massive cost of keeping all of these embeddings in memory ready to be searched.

omeze|2 years ago

Everyone is piling on you but Id love to see what their companies are doing. Cosine similarity and loading a few thousand rows sounds trivial but most of the enterprise/b2b chat/copilot apps have a relatively small amount of data whose embeddings can fit in RAM. Combine that with natural sharding by customer ID and it turns out vector DBs are much more niche than an RDBMS. I suspect most people reaching for them haven’t done the calculus :/

coffeebeqn|2 years ago

People rushing to slap “AI” on their products don’t really know what they need? Yea that’s absolutely what’s happening now

marginalia_nu|2 years ago

1k rows isn't really at a point where you need any form of database. Vector or BOW, you can just bruteforce the search with such a miniscule amount of data (arguably this should be true into the low millions).

The problem is what happens when you have an additional 6 orders of magnitude of data, and the data itself is significantly larger than the system RAM, which is a very realistic case in a search engine.

BeetleB|2 years ago

1k is not much. My first RAG had over 40K docs (all short, but still...)

The one I'm working on right now has 115K docs (some quite big - I'll likely have to prune the largest 10% just to fit in my RAM).

These are all "small" - for personal use on my local machine. I'm currently RAM limited, otherwise I can think of (personal) use cases that are an order of magnitude larger.

Of course, for all I know, your method may still be as fast on those as on a vector DB.

hnfong|2 years ago

I must be missing something -- why is the size of the documents a factor? If you embeded a document it would become a vector of ~1k floats, and 115k*1k floats is a couple hundred MB, trivial to fit in modern day RAM.

infecto|2 years ago

Even on the production side there is something to be said about just doing things in memory, even over larger datasets. Certainly like all things there is a possible scale issue but I would much rather spin up a dedicated machine with a lot of memory than pay some of the wildly high fees for a Vector DB.

Not sure if others have gone down this path but I have been testing out ways to store vectors to disk in files for later retrieval and then doing everything in memory. For me the tradeoff of a sligtly slower response time was worth it compared to the 4-5 figure bill I would be getting from a vector DB otherwise.

ninja3925|2 years ago

True.

Also, you are probably doing it wrong by turning a matrix to matrix multiplication into a for loop (over rows). The optimal solution results in better performance

sim = np.vstack(df.col) @ vec

andy99|2 years ago

There is certainly some scale at which a more sophisticated approach is needed. But your method (maybe with something faster than python/pandas) should be the go-to for demonstration and kept until it's determined that the brute force search is the bottleneck.

This issue is prevalent throughout infrastructure projects. Someone decides they need a RAG system and then the team says "let's find a vector db provider!" before they've proven value or understood how much data they have or anything. So they waste a bunch of time and money before they even know if the project is likely to work.

It's just like the old model of setting up a hadoop cluster as a first step to do "big data analytics" on what turns out to be 5GB of data that you could fit in a dataframe or process with awk https://adamdrake.com/command-line-tools-can-be-235x-faster-... (edit: actually currently on the HN front page)

It's a perfedt storm of sales led tooling where leadership is sold something they don't understand, over-engineering, and trying to apply waterfall project management to "AI" projects that have lots of uncertainty and need a re-risking based project approach where you show that it's liable to work and iterate instead of building a big foundation first.

thelastparadise|2 years ago

> 5GB of data that you could fit in a dataframe or process with awk

These days anything less than 2TB should be done 100% in memory.

jxmorris12|2 years ago

Even up to 1M or so rows you can just store everything in a numpy array or PyTorch tensor and compute similarity directly between your query embedding and the entire database. Will be much faster than the apply() and still feasible to run on a laptop.

baldeagle|2 years ago

You may benefit from polars, it can multi-core better than pandas, and has some of the niceties from Arrow (which was the written / championed by the power duo of Wes and Hadley, authors of pandas and the R - tidyverse respectively).

softwaredoug|2 years ago

I agree pandas or whatever data frame library you like is ideal for prototyping and exploring than setting up a bunch of infrastructure in a dev environment. Especially if you have labels and are evaluating against a ground truth.

You might be interested in SearchArray which emulates the classic search index side of things in a pandas dataframe column

https://github.com/softwaredoug/searcharray

jimmySixDOF|2 years ago

Thanks for the article and definitely agree you are better off to start it simple like a parquet file and faiss and then test out options with your data. I say that mainly to test chunking strategies because of how big an effect it has on everything downstream whatever vector db or bert path you take -- chunking is a much bigger impact source than most people acknowledge.

namibj|2 years ago

I'm expecting to deploy a 6-figure "row count" RAG in the near future... with CTranslate2, matmul-based, at most lightly (like, single digits?) batched, and probably defaulting to CPU because the encoder-decoder part of the RAG process is just way more expensive and the database memory hog along with relatively poor TopK performance isn't worth the GPU.

zzleeper|2 years ago

That's kinda why I use LanceDB. It works on all three OSes, doesn't require large installs, and is quite easy to use. The files are also just Parquet, so no need to deall with SQL.

Uncroyable|2 years ago

I mean, you have 1k rows and it is a "prototype".

unknown|2 years ago

[deleted]

petters|2 years ago

Think about the number of flops needed for each comparison in brute force search.

You'll realize that it scales well beyond 1k.

visarga|2 years ago

use np.dot, takes 1 line

whalesalad|2 years ago

1k rows? Sounds like kindergarten.

visarga|2 years ago

up to 100k rows you don't get faster by using vector store, just use numpy

jjtheblunt|2 years ago

What RAG systems do you prototype?

hcks|2 years ago

You could do it by hand at that scale too

(no title)

discuss