As others have correctly pointed out, to make a vector search or recommendation application requires a lot more than similarity alone. We have seen the HNSW become commoditised and the real value lies elsewhere. Just because a database has vector functionality doesn’t mean it will actually service anything beyond “hello world” type semantic search applications. IMHO these have questionable value, much like the simple Q and A RAG applications that have proliferated. The elephant in the room with these systems is that if you are relying on machine learning models to produce the vectors you are going to need to invest heavily in the ML components of the system. Domain specific models are a must if you want to be a serious contender to an existing search system and all the usual considerations still apply regarding frequent retraining and monitoring of the models. Currently this is left as an exercise to the reader - and a very large one at that. We (https://github.com/marqo-ai/marqo, I am a co-founder) are investing heavily into making the ML production worthy and continuous learning from feedback of the models as part of the system. Lots of other things to think about in how you represent documents with multiple vectors, multimodality, late interactions, the interplay between embedding quality and HNSW graph quality (i.e. recall) and much more.
In general I find they're incredible good for being able to rapidly build out search engines for things that would it would normally be difficult to do with plain text.
The most obvious example is code search where you can describe the function's behavior and get a match. But you could also make a searchable list of recipes that would allow a user to search something like "a hearty beef dish for a cold fall night". Or searching support tickets where full text might not match, "all the cases where users had trouble signing on".
Interestingly Q & A is ultimately a (imho fairly boring) implementation of this pattern.
The really nice part is that you can implement working demos of this projects in just a few lines of code once you have the vector db set up. Once you start thinking in terms of semantic search over text matching, you realize you can build old-Google style search engines for basically any text available to you.
One thing that is a bit odd about the space is, from what I've experienced and heard, is that setup and performance on most of this products is not all that great. Given that you can implement the demo version of a vector db in a few lines of numpy, you would hope that investing in a full vector db product we get you an easily scalable solution.
Everyone I talk to who is building some vector db based thing sooner or later realizes they also care about the features of a full-text search engine.
They care about filtering, they care to some degree about direct lexical matches, they care about paging, getting groups / facet counts, etc.
Vectors, IMO, are just one feature that a regular search engine should have. IMO currently Vespa does the best job of this, though lately it seems Lucene (Elasticsearch and Opensearch) are really working hard to compete
My company is using vector search with Elasticsearch. It’s working well so far. IMO Elastic will eat most vector-first/only products because of its strength at full-text search, plus all the other stuff it does.
Until very recently, “dense retrieval” was not even as good as bm25, and still is not always better.
I think a lot of people use dense retrieval in applications where sparse retrieval is still adequate and much more flexible, because it has the hype behind it. Hybrid approaches also exist and can help balance the strengths and weaknesses of each.
Vectors can also work in other tasks, but largely people seem to be using them for retrieval only, rather than applying them to multiple tasks.
Vector search is not exclusively in the domain of text search. There is always image/video search.
But pre-filtering is important, since you want to reduce the set of items to be matched on and it feels like Elasticsearch/OpenSearch are fairing better in this regard. Mixed scoring derived from both both sparse and dense calculations is also important, which is another strength of ES/OS.
I'm building a RAG for my personal use: Say I have a lot of notes on various topics I've compiled over the years. They're scattered over a lot of text files (and org nodes). I want to be able to ask questions in a natural language and have the system query my notes and give me an answer.
The approach I'm going for is to store those notes in a vector DB. When I ask my query, a search is performed and, say, the top 5 vectors are sent to GPT for parsing (along with my query). GPT will then come back with an answer.
I can build something like this, but I'm struggling in figuring out metrics for how good my system is. There are many variables (e.g. amount of content in a given vector, amount of overlap amongst vectors, number of vectors to send to GPT, and many more). I'd like to tweak them, but I also want some objective way to compare different setups. Right now all I do is ask a question, look at the answer, and try to subjectively gauge whether I think it did a good job.
Any tips on how people measure the performance/effectiveness for these types of problems?
For small personal projects its kind of hard to build metrics like this because the volume of indexed content in the database tends to be pretty low. If you're indexing paragraphs you might consistently be able to fit all relevant paragraphs in the context itself.
What I can recommend is to take the coffee tasting approach. Don't try and test and evaluate individual responses, instead lock the seed used in generation, and use the same prompt for two different runs. Change one variable and do a relative comparison of the two outputs. The variables probably worth testing for you off the top of my head:
* Choice of models and/or tunes
* System prompts
* Temperature of the model against your queries
* Threshold for similarity for document inclusions (you only want relevant documents from your RAG, set it too low and you'll get some extra distractions, too high and useful information might be left out of the context).
If you setup a system to track the comparisons either automatically or by hand that just indicates which side of the change worked better for your use case, and test that same change for a bunch of different prompts you should be able to tally up whether the control or change was more preferred.
Keep those data points! The data points are your bench log and can be invaluable later on for anything you do with the system to see what changed in aggregate, what had the most outsized impact, etc and can guide you to build useful tooling for testing or finding existing solutions out there.
I use lots and lots of domain specific test cases at several layers, numbering in the hundreds or thousands. The score is the number of test cases that pass so it requires a different approach than all or nothing tests. The layers depend on your RAG “architecture” but I test the RAG query generation and scoring (comparing ordered lists is the simplest but I also include a lot of fuzzy comparisons), the LLM scoring the relevance of retrieved snippets before feeding into the final answering prompt, and the final answer. The most annoying part is the prompt to score the final answer, since it tends out to come out looking like a CollegeBoard AP test scoring rubric.
This requires a lot of domain specific work. For example, two of my test cases are “Is it [il]legal to build an atomic bomb” run against the entire USCode [1] so I have a list of sections that are relevant to the question that I’ve scored before eventually getting an answer of “it is illegal” followdd by several prompts that evaluate nuance in the answer (“it’s illegal except for…”). I have hundreds of these test cases, approaching a thousand. It’s a slog.
[1] 42 U.S.C. 2122 is one of the “right” sections in case anyone is wondering. Another step tests whether 2121 is pulled in based on the mention in 2122
The main thing is that there's no "objective" way, but if you rank and label your own data then you can certainly get a ranking that's subjectively well performing according to you.
RAG in this case is essentially the same as a recommender system so you can approach it with the same metrics you would there.
You'll need to build a data set with known correct answers but then it's basically, NDCG (Normalized Discounted Cumulative Gain) is a good place to start, MRR (Mean Reciprocal Rank) and MAP (Mean Absolute Precision) are other options. You could also just look at the accuracy of getting your result in the top K results for various thresholds for k (which can be interpreted as the "probability of getting your result in 'k' results).
Included here is a bit of the old tried and true: NDCG/MRR/Precision @k - what you really want for measuring your information retrieval systems.
But we also talk through a bit of the "new", how to use Evals to generate the building blocks for those metrics above. You will want both hand labels and the automated Evals in the end to evaluate your system.
txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.
Embeddings databases are a union of vector indexes (sparse and dense), graph networks and relational databases. This enables vector search with SQL, topic modeling and retrieval augmented generation.
txtai adopts a local-first approach. A production-ready instance can be run locally within a single Python instance. It can also scale out when needed.
txtai can use Faiss, Hnswlib or Annoy as it's vector index backend. This is relevant in terms of the ANN-Benchmarks scores.
Hence why I’d be interested to know more about the supporting details for the different categories. It may help uncover some inadvertent errors in the analysis, but also would just serve as a useful jumping-off point for people doing their own research as well.
Totally agree with the puzzling assortment of a rubric. PostgreSQL supports role based-access control, RBAC. Not to mention, with PostgreSQL and the pgvector extension, you have a whole list of languages ready to use it:
C++ pgvector-cpp
C# pgvector-dotnet
Crystal pgvector-crystal
Dart pgvector-dart
Elixir pgvector-elixir
Go pgvector-go
Haskell pgvector-haskell
Java, Scala pgvector-java
Julia pgvector-julia
Lua pgvector-lua
Node.js pgvector-node
Perl pgvector-perl
PHP pgvector-php
Python pgvector-python
R pgvector-r
Ruby pgvector-ruby, Neighbor
Rust pgvector-rust
Swift pgvector-swift
Wonder how many of those other Vector databases play nice.
That stood out to me as well. I've been playing with pgvector, and there's no reason you can't use row/table role-based security.
I think there's an unmentioned benefit to using something like pgvector also. You don't need a separate relational database! In fact you can have foreign keys to your vectors/embeddings which is super powerful to me.
Same for Developer experience. If you used Postgres or any other relational db (which I think covers a large % of devs), you could easily argue the dev experience is 3/3 for pgvector.
I made this table to compare vector databases in order to help me choose the best one for a new project. I spent quite a few hours on it, so I wanted to share it here too in hopes it might help others as well. My main criteria when choosing vector DB were the speed, scalability, dx, community and price. You'll find all of the comparison parameters in the article.
I'd love to know how vector databases compare in their ability to do hybrid queries, vector similarity filtered by metadata values. For example, find the 100 items with the closest cosine similarity where genre = jazz and publication date between 1990 and 2000.
Can the vector index operate on a subset of records? Or when searching for 100 closest matches does the database have to find 1000 matches and then apply the metadata filter, and hope that doesn't reduce the result set down to zero and exclude relevant vectors?
It seems like measuring precision and recall for hybrid queries would be illuminating.
I can't speak to the others, but pgvector indices can "break" hybrid queries. For example, if you select using a where clause specifying metadata (where genre = jazz) and order by distance from a vector (embedding of sound clip); if the index doesn't have a lot (or any) vectors in the sphere of the query vector that also match the metadata it can return no results. I discuss this in a blog post here [1].
Curious about the lack of Vespa, especially given the thoroughness of the article and its long-time reputation. OpenSearch is also missing, but perhaps it can be considered being lumped in with Elasticsearch due to them both being based on Lucene. The products are starting to diverge, so would be nice to see, especially since it is open-source.
For the performance-based columns, would be also helpful to see which versions were tested. There is so much attention lately for vector databases, that they all are making great strides forward. The Lucene updates are notable.
What advantage are vector databases providing above using an index in conjunction with a mature database? I’m not sold on this as a separate technology.
Vector search is useful, but I don’t understand why I would go out of my way when I could implement FAISS or HNSWlib as an adjunct to postgres or a document store.
Vector extensions to your current database or search engine makes far more sense than adding yet another dependency to manage and operate. The vector database folks will have to become a real database or full featured search engine to survive and compete with the incumbents that will all have good solutions for vector similarity search.
The thing is if you need a vector _database_ there is no reason why it can't be a pg extensions. And if you project is only small scale there is probably some HNSW pg extension library you could use.
But what is most times needed instead of a vector database is a efficient fast responsive vectore approximate KNN search system with fast attribute filtering which overlaps with a fast an efficient text search system (e.g. bm25 based)
And if you then go to billion vector scale things become tricky performance wise.
And then you reach the same point at which companies do things like using warehouse approach where you have a read only extremely read optimized mostly in memory variant of their db they access for searches only and changes from their main db a streamed to the read only search instance, potentially while losing snapshot views, transactions and similar.
You could say that approx. KNN vector search is the new must have feature for unstructured fuzzy text search, and while you can have unstructured fuzzy text search in pg it's also often not the go-to solution if your databse is just for getting that search.
Strongly disagree with PGVector's DX being worse than Chroma. Installing, configuring, and working with Chroma was infuriating -- it's alpha software and has the bugs and rough edges to prove it. The tools to support and interface with postgres are battle-tested and so much nicer by comparison; getting Chroma working took over a week, ripping it out and replacing with PGVector took a couple hours.
Also agree with this[0] article that vector search is only one type of search, and even for RAG isn't necessarily the one you want to start with.
Yeah, I had a similar experience with Chroma DB. On paper, it checked all my boxes. But yea, it's alpha software with the first non-prerelease version only coming out in July 2023 (so it's 3 months old).
I ran into some dumb issues during install like the SQLite version being incorrect, and there wasn't much guidance on how to fix these problems, so gave up after struggling for a few hours. Switched to PGVector which was much simpler to setup. I hope Chroma DB improves, but I wouldn't recommend it for now.
Thanks for your input, I've only tried Chroma a little bit so far and had a pretty good experience. What they also have going for them is a big community on discord that can be helpful.
I quickly took a look at the redisearch ANN Benchmarks and they seem to stack up against the others (more or less same level as Milvus) in the comparison when it comes to QPS and Latency.
I'm currently in the market for a self hosted DB for a personal project. The project is an app you can run on your own system and provide QA on your text files. So I'm looking for something light weight, but I'm also looking for the best possible search and ANN retrieval is just a single part of that.
Their definition about Hybrid Search is I think wrong.
Through this terms tend to not be consistently defined at all, so "wrong" is maybe the wrong word.
Their definition seem to be about filtering results during (approximate) KNN vector search.
But that is filtering, not hybrid search. Through it might sometimes be implemented as a form of hybrid search, but that's an internal implementation detail and you probably should hope it's not implemented that way.
Hybrid search is when you do both a vector search and a more classical text based search (e.g. bm25) and combine both results in a reasonable way.
The way you explain hybrid search aligns with my understanding. Pinecone has a good article about it here https://www.pinecone.io/learn/hybrid-search-intro/. From my understanding, all vector DBs support this.
This is interesting because it does not mention Vector database powered by Apache Cassandra or the hosted serverless version DataStax Astra. Here is write up we did on 5 hard problems in Vector database and how we solved them. https://thenewstack.io/5-hard-problems-in-vector-search-and-...
In full transparency: I work for DataStatx and lead engineering for Vector database.
I don't think we need specialized databases for vectors. Relational databases can easily be expanded by vector data types and operations. They will eventually catch up by supporting what was once a unique feature of the new system: https://medium.com/@magda7817/two-things-to-keep-in-mind-bef...
Yeah, this is my sense too. They will be slower to add these new requirements but they should be able to add these vector capabilities within a year or so. It's then a question of ability of smaller vector db companies to mature and add regular db capabilities, while innovating.
Agreed on pgvector being simple and a great choice for POCs and low scale, especially if you're familiar with Postgres. Our team released something new last week built for folks looking to use PostgreSQL at scale as a vector store [0], featuring a DiskANN index type.
Quick question regarding the scalability and support of multiple vector databases under a single cloud service. Suppose an enterprise Saas product served multiple customers with each requiring a unique RAG vector knowledge-base for product and company info. Do any of these solutions allow for a large number (dozens or hundreds) of small distinct Knowledge bases? Do any offer easily integrated automated pipelines for documents to be parsed and ingested?
Postgres with PGVector is the best database, plus vectors.
All of the "Vector DBs" suffer horribly when trying to do basic things.
Want to include any field that matches a field in an array of keys? Easy in SQL. Requires an entire song and dance in Pinecone or Weaviate.
After implementing Chroma, Weaviate, Pinecone, Sqlite with HNSW indices and Qdrant-- I'm not impressed. Postgres is measurably faster since so much relies on pre-filtering, joins, etc.
Strongly disagree about the Pinecone developer experience. Not that they don't have SDKs, but last I checked they didn't have documentation on how to approach local dev environments.
The implication being that you spin up a separate index for $70/mo, and then you have to upsert any relevant data yourself. Sure that's not difficult, but why do you have to do it at all? Why doesn't Pinecone make it easy to replicate data to another index for use in dev/staging?
You might like the 'Which Search Engine?' panel I ran at Buzzwords earlier this year with some of the leading contenders (Vespa, Qdrant, Elastic, Solr, Weaviate) https://www.youtube.com/watch?v=iI40L4wMtyI - vector search was obviously part of the discussion
20M vectors @768 is about 62GB, for 32bit, not even quantized. AWS RDS will put it at 83USD/m (db.t4g.small, 2vcpu 2GB RAM). But that's not with egress, backups, etc
Seems acceptable at least for a POC?
A better option if you already have the data in the same instance, but developer experience being low scares me. Anyone tried it? How did it go?
I'm interested to try some of these others next time around, but I've used qdrant self-hosted in two projects and been pleased. Milvus was recommended so I gave that a try but found it over complicated. Pgvector seems like an obvious choice if you are already using postgres and if that performance is ok.
Latency from embedding models is still going to be the bottleneck for performance however fast the DB is going to be. Plus adding all the overhead of synthesising answers and summaries from a LLM is going to weigh you down.
We conducted benchmark tests on Elastic's queries per second (QPS) performance using datasets of 500,000 and 1 million vectors. Result was Zilliz is 13x and 22x faster, per number of vectors respectively. https://zilliz.com/blog/elasticsearch-cloud-vs-zilliz
We also conducted a benchmark comparing Pgvector to both Milvus (open source) and Zilliz (managed, with a free tier option). When running the OSS Milvus on 2 CPUs and 8 GiB memory, Pgvector was found to be 5 times slower. You can check out the detailed performance charts at the bottom of this blog post:
https://zilliz.com/blog/getting-started-pgvector-guide-devel...
Feel free to explore our open-source benchmarking tool, which allows you to examine our methodology and even compare it with your vector database. https://github.com/zilliztech/VectorDBBench
Yeah, that's the difference we've seen according to the QPS for the ANN Benchmarks. The same story seems to be true for other datasets too. We're looking at a 0.9 recall.
Many of them are open source and you can host them yourself. That would make it more cost effective. Also someone mentioned https://turbopuffer.com/. That seems like a good alternative if you're looking for something economical.
Somehow I felt that at least part of the articles was generated by a LLM. It’s unfortunate to see that a new bias has started to creep up. Whatever I read now I second guess and I feel it maybe partially or fully generated by LLMs.
jn2clark|2 years ago
PheonixPharts|2 years ago
In general I find they're incredible good for being able to rapidly build out search engines for things that would it would normally be difficult to do with plain text.
The most obvious example is code search where you can describe the function's behavior and get a match. But you could also make a searchable list of recipes that would allow a user to search something like "a hearty beef dish for a cold fall night". Or searching support tickets where full text might not match, "all the cases where users had trouble signing on".
Interestingly Q & A is ultimately a (imho fairly boring) implementation of this pattern.
The really nice part is that you can implement working demos of this projects in just a few lines of code once you have the vector db set up. Once you start thinking in terms of semantic search over text matching, you realize you can build old-Google style search engines for basically any text available to you.
One thing that is a bit odd about the space is, from what I've experienced and heard, is that setup and performance on most of this products is not all that great. Given that you can implement the demo version of a vector db in a few lines of numpy, you would hope that investing in a full vector db product we get you an easily scalable solution.
softwaredoug|2 years ago
They care about filtering, they care to some degree about direct lexical matches, they care about paging, getting groups / facet counts, etc.
Vectors, IMO, are just one feature that a regular search engine should have. IMO currently Vespa does the best job of this, though lately it seems Lucene (Elasticsearch and Opensearch) are really working hard to compete
vosper|2 years ago
deepsquirrelnet|2 years ago
I think a lot of people use dense retrieval in applications where sparse retrieval is still adequate and much more flexible, because it has the hype behind it. Hybrid approaches also exist and can help balance the strengths and weaknesses of each.
Vectors can also work in other tasks, but largely people seem to be using them for retrieval only, rather than applying them to multiple tasks.
donretag|2 years ago
But pre-filtering is important, since you want to reduce the set of items to be matched on and it feels like Elasticsearch/OpenSearch are fairing better in this regard. Mixed scoring derived from both both sparse and dense calculations is also important, which is another strength of ES/OS.
ruslandanilin|2 years ago
pmc00|2 years ago
We recently did a bunch of evaluation work to quantify the differences between keyword search, vector search, hybrid, reranking, etc. across a few datasets. We shared the results here: https://techcommunity.microsoft.com/t5/azure-ai-services-blo...
Disclosure - I work in the Azure Search team.
unknown|2 years ago
[deleted]
frogperson|2 years ago
unknown|2 years ago
[deleted]
noonething|2 years ago
emilfroberg|2 years ago
BeetleB|2 years ago
I'm building a RAG for my personal use: Say I have a lot of notes on various topics I've compiled over the years. They're scattered over a lot of text files (and org nodes). I want to be able to ask questions in a natural language and have the system query my notes and give me an answer.
The approach I'm going for is to store those notes in a vector DB. When I ask my query, a search is performed and, say, the top 5 vectors are sent to GPT for parsing (along with my query). GPT will then come back with an answer.
I can build something like this, but I'm struggling in figuring out metrics for how good my system is. There are many variables (e.g. amount of content in a given vector, amount of overlap amongst vectors, number of vectors to send to GPT, and many more). I'd like to tweak them, but I also want some objective way to compare different setups. Right now all I do is ask a question, look at the answer, and try to subjectively gauge whether I think it did a good job.
Any tips on how people measure the performance/effectiveness for these types of problems?
TrueDuality|2 years ago
What I can recommend is to take the coffee tasting approach. Don't try and test and evaluate individual responses, instead lock the seed used in generation, and use the same prompt for two different runs. Change one variable and do a relative comparison of the two outputs. The variables probably worth testing for you off the top of my head:
* Choice of models and/or tunes
* System prompts
* Temperature of the model against your queries
* Threshold for similarity for document inclusions (you only want relevant documents from your RAG, set it too low and you'll get some extra distractions, too high and useful information might be left out of the context).
If you setup a system to track the comparisons either automatically or by hand that just indicates which side of the change worked better for your use case, and test that same change for a bunch of different prompts you should be able to tally up whether the control or change was more preferred.
Keep those data points! The data points are your bench log and can be invaluable later on for anything you do with the system to see what changed in aggregate, what had the most outsized impact, etc and can guide you to build useful tooling for testing or finding existing solutions out there.
civilitty|2 years ago
This requires a lot of domain specific work. For example, two of my test cases are “Is it [il]legal to build an atomic bomb” run against the entire USCode [1] so I have a list of sections that are relevant to the question that I’ve scored before eventually getting an answer of “it is illegal” followdd by several prompts that evaluate nuance in the answer (“it’s illegal except for…”). I have hundreds of these test cases, approaching a thousand. It’s a slog.
[1] 42 U.S.C. 2122 is one of the “right” sections in case anyone is wondering. Another step tests whether 2121 is pulled in based on the mention in 2122
screye|2 years ago
Blog on the same topic - https://blog.langchain.dev/evaluating-rag-pipelines-with-rag...
hobs|2 years ago
The main thing is that there's no "objective" way, but if you rank and label your own data then you can certainly get a ranking that's subjectively well performing according to you.
PheonixPharts|2 years ago
You'll need to build a data set with known correct answers but then it's basically, NDCG (Normalized Discounted Cumulative Gain) is a good place to start, MRR (Mean Reciprocal Rank) and MAP (Mean Absolute Precision) are other options. You could also just look at the accuracy of getting your result in the top K results for various thresholds for k (which can be interpreted as the "probability of getting your result in 'k' results).
jlopes2|2 years ago
Included here is a bit of the old tried and true: NDCG/MRR/Precision @k - what you really want for measuring your information retrieval systems.
But we also talk through a bit of the "new", how to use Evals to generate the building blocks for those metrics above. You will want both hand labels and the automated Evals in the end to evaluate your system.
Extasia785|2 years ago
dmezzetti|2 years ago
txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.
Embeddings databases are a union of vector indexes (sparse and dense), graph networks and relational databases. This enables vector search with SQL, topic modeling and retrieval augmented generation.
txtai adopts a local-first approach. A production-ready instance can be run locally within a single Python instance. It can also scale out when needed.
txtai can use Faiss, Hnswlib or Annoy as it's vector index backend. This is relevant in terms of the ANN-Benchmarks scores.
Disclaimer: I am the author of txtai
Der_Einzige|2 years ago
emilfroberg|2 years ago
drewbug01|2 years ago
For example, pgvector is listed as not having role-based access control, but the Postgres manual dedicates an entire chapter to it: https://www.postgresql.org/docs/current/user-manag.html
Hence why I’d be interested to know more about the supporting details for the different categories. It may help uncover some inadvertent errors in the analysis, but also would just serve as a useful jumping-off point for people doing their own research as well.
proleisuretour|2 years ago
C++ pgvector-cpp C# pgvector-dotnet Crystal pgvector-crystal Dart pgvector-dart Elixir pgvector-elixir Go pgvector-go Haskell pgvector-haskell Java, Scala pgvector-java Julia pgvector-julia Lua pgvector-lua Node.js pgvector-node Perl pgvector-perl PHP pgvector-php Python pgvector-python R pgvector-r Ruby pgvector-ruby, Neighbor Rust pgvector-rust Swift pgvector-swift
Wonder how many of those other Vector databases play nice.
sojournerc|2 years ago
I think there's an unmentioned benefit to using something like pgvector also. You don't need a separate relational database! In fact you can have foreign keys to your vectors/embeddings which is super powerful to me.
mritchie712|2 years ago
hereonout2|2 years ago
emilfroberg|2 years ago
andre-z|2 years ago
panarky|2 years ago
Can the vector index operate on a subset of records? Or when searching for 100 closest matches does the database have to find 1000 matches and then apply the metadata filter, and hope that doesn't reduce the result set down to zero and exclude relevant vectors?
It seems like measuring precision and recall for hybrid queries would be illuminating.
andre-z|2 years ago
mvcalder|2 years ago
[1]: https://www.polyscale.ai/blog/pgvector-bigger-boat/
prabhatjha|2 years ago
gk1|2 years ago
mistrial9|2 years ago
"no" - the graph objects after training are opaque AFAIK
donretag|2 years ago
For the performance-based columns, would be also helpful to see which versions were tested. There is so much attention lately for vector databases, that they all are making great strides forward. The Lucene updates are notable.
emilfroberg|2 years ago
deepsquirrelnet|2 years ago
Vector search is useful, but I don’t understand why I would go out of my way when I could implement FAISS or HNSWlib as an adjunct to postgres or a document store.
spullara|2 years ago
dathinab|2 years ago
But what is most times needed instead of a vector database is a efficient fast responsive vectore approximate KNN search system with fast attribute filtering which overlaps with a fast an efficient text search system (e.g. bm25 based)
And if you then go to billion vector scale things become tricky performance wise.
And then you reach the same point at which companies do things like using warehouse approach where you have a read only extremely read optimized mostly in memory variant of their db they access for searches only and changes from their main db a streamed to the read only search instance, potentially while losing snapshot views, transactions and similar.
You could say that approx. KNN vector search is the new must have feature for unstructured fuzzy text search, and while you can have unstructured fuzzy text search in pg it's also often not the go-to solution if your databse is just for getting that search.
dmezzetti|2 years ago
1. https://neuml.github.io/txtai/embeddings/indexing/
2. https://neuml.hashnode.dev/external-database-integration
citruscomputing|2 years ago
Also agree with this[0] article that vector search is only one type of search, and even for RAG isn't necessarily the one you want to start with.
[0]: https://colinharman.substack.com/p/beware-tunnel-vision-in-a...
luckyt|2 years ago
I ran into some dumb issues during install like the SQLite version being incorrect, and there wasn't much guidance on how to fix these problems, so gave up after struggling for a few hours. Switched to PGVector which was much simpler to setup. I hope Chroma DB improves, but I wouldn't recommend it for now.
emilfroberg|2 years ago
fzliu|2 years ago
Pandabob|2 years ago
[0]: https://twitter.com/sh_reya/status/1661136833848438784
esafak|2 years ago
Euclidean distance, inner product, and cosine similarity are supported.
emilfroberg|2 years ago
J_Shelby_J|2 years ago
There is also this series of articles detailing the options and it includes some that the OP is missing: https://thedataquarry.com/posts/vector-db-1/#key-takeaways
I'm currently in the market for a self hosted DB for a personal project. The project is an app you can run on your own system and provide QA on your text files. So I'm looking for something light weight, but I'm also looking for the best possible search and ANN retrieval is just a single part of that.
dathinab|2 years ago
Through this terms tend to not be consistently defined at all, so "wrong" is maybe the wrong word.
Their definition seem to be about filtering results during (approximate) KNN vector search.
But that is filtering, not hybrid search. Through it might sometimes be implemented as a form of hybrid search, but that's an internal implementation detail and you probably should hope it's not implemented that way.
Hybrid search is when you do both a vector search and a more classical text based search (e.g. bm25) and combine both results in a reasonable way.
emilfroberg|2 years ago
prabhatjha|2 years ago
In full transparency: I work for DataStatx and lead engineering for Vector database.
magden|2 years ago
emilfroberg|2 years ago
rnk|2 years ago
ldjkfkdsjnv|2 years ago
avthar|2 years ago
[0]: https://www.timescale.com/blog/how-we-made-postgresql-the-be...
alxfoster|2 years ago
__newmoon__|2 years ago
All of the "Vector DBs" suffer horribly when trying to do basic things.
Want to include any field that matches a field in an array of keys? Easy in SQL. Requires an entire song and dance in Pinecone or Weaviate.
After implementing Chroma, Weaviate, Pinecone, Sqlite with HNSW indices and Qdrant-- I'm not impressed. Postgres is measurably faster since so much relies on pre-filtering, joins, etc.
bobvanluijt|2 years ago
iansinnott|2 years ago
The implication being that you spin up a separate index for $70/mo, and then you have to upsert any relevant data yourself. Sure that's not difficult, but why do you have to do it at all? Why doesn't Pinecone make it easy to replicate data to another index for use in dev/staging?
alter123|2 years ago
emilfroberg|2 years ago
softwaredoug|2 years ago
charliejuggler|2 years ago
BenoitP|2 years ago
20M vectors @768 is about 62GB, for 32bit, not even quantized. AWS RDS will put it at 83USD/m (db.t4g.small, 2vcpu 2GB RAM). But that's not with egress, backups, etc
Seems acceptable at least for a POC?
A better option if you already have the data in the same instance, but developer experience being low scares me. Anyone tried it? How did it go?
LunaSea|2 years ago
Vector indexes are very large, almost the size of the original data, and that needs to fit into the database memory ideally.
totalhack|2 years ago
freilanzer|2 years ago
krishadi|2 years ago
charcircuit|2 years ago
NicoJuicy|2 years ago
emilfroberg|2 years ago
AYBABTME|2 years ago
unknown|2 years ago
[deleted]
kesor|2 years ago
Havoc|2 years ago
I thought for most use cases this would be quite performance sensitive
bayesian_limit|2 years ago
We also conducted a benchmark comparing Pgvector to both Milvus (open source) and Zilliz (managed, with a free tier option). When running the OSS Milvus on 2 CPUs and 8 GiB memory, Pgvector was found to be 5 times slower. You can check out the detailed performance charts at the bottom of this blog post: https://zilliz.com/blog/getting-started-pgvector-guide-devel...
Feel free to explore our open-source benchmarking tool, which allows you to examine our methodology and even compare it with your vector database. https://github.com/zilliztech/VectorDBBench
emilfroberg|2 years ago
kadomony|2 years ago
aries185|2 years ago
brigadier132|2 years ago
emilfroberg|2 years ago
ChrisPzilla|2 years ago
esafak|2 years ago
lazy_moderator1|2 years ago
la64710|2 years ago