This is actually really cool, and despite what I'm sure will come off as (constructive) criticism, I am very impressed!
First, I think you oversell the overhead of keeping data in sync and the costs of not doing so in a timely manner. Almost any distributed system that is using multiple databases already needs to have a strategy for dealing with inconsistent data. As far as this problem goes, inconsistent embeddings are a pretty minor issue given that (1) most embedding-based workflows don't do a lot of updating/deletion; and (2) the sheer volume of embeddings from only a small corpus of data means that in practice you're unlikely to notice consistency issues. In most cases you can get away with doing much less than is described in this post. That being said, I want to emphasize that I still think not having to worrying about syncing data is indeed cool.
Second, IME the most significant drawback to putting your embeddings in a Postgres database with all your other data is that the workload looks so different. To take one example, HNSW indices using pgvector consume a ton of resources - even a small index of tens of millions of embeddings may be hundreds of gigabytes on disk and requires very aggressive vacuuming to perform optimally. It's very easy to run into resource contention issues when you effectively have an index that will consume all the available system resources. The canonical solution is to move your data into another database, but then you've recreated the consistency problem that your solution purports to solve.
Third, a question: how does this interact with filtering? Can you take advantage of partial indices on the underlying data? Are some of the limitations in pgvector's HNSW implementation (as far as filtering goes) still present?
Post co-author here. Really appreciate the feedback.
Your point about HNSW being resource intensive is one we've heard. Our team actually built another extension called pgvectorscale [1] which helps scale vector search on Postgres with a new index type (StreamingDiskANN). It has BQ out the box and can also store vectors on disk vs only in memory.
Another practice I've seen work well is for teams use to use a read replica to service application queries and reduce load on the primary database.
To answer your third question, if you combine Pgai Vectorizer with pgvectorscale, the limitations around filtered search in pgvector HNSW are actually no longer present. Pgvectorscale implements streaming filtering, ensuring more accurate filtered search with Postgres. See [2] for details.
Elasticsearch has recently added a data type called semantic_text, which automatically chunks text, calculates embeddings, and stores the chunks with sensible defaults.
Queries are similarly simplified, where vectors are calculated and compared internally, which makes a lot less I/O and a lot simpler client code.
Hey HN! Post co-author here, excited to share our new open-source PostgreSQL tool that re-imagines vector embeddings as database indexes. It's not literally an index but it functions like one to update embeddings as source data gets added, deleted or changed.
Right now the system only supports OpenAI as an embedding provider, but we plan to extend with local and OSS model support soon.
Eager to hear your feedback and reactions. If you'd like to leave an issue or better yet a PR, you can do so here [1]
Hey, this is really cool! Thanks for the article and the tool itself.
One question - in the RAG projects we've done, most of the source data was scattered in various source systems, but wasn't necessarily imported into a single DB or Data Lake. For example, building an internal Q&A tool for a company that has knowledge stored in services like Zendesk, Google Drive, an internal company Wiki, etc.
In those cases, it made sense to not import the source documents, or only import metadata about them, and keep the embeddings in a dedicated Vector DB. This seems to me to be a fairly common use case - most enterprises have this kind of data scattered across various systems.
How do you envision this kind of use case working with this tool? I may have missed it, but you mention things like working with images, etc, is your assumption that everyone is storing all of that data in Postgres?
Pretty smart. Why is the DB api the abstraction layer though? Why not two columns and a microservice. I assume you are making async calls to get the embeddings?
I say that because it seems n
unsual. Index would suit sync better. But async things like embeddings, geo for an address, is this email considered a spammer etc. feel like app level stuff.
This is super cool! One suggestion for the blog: I would put "re-imagines vector embeddings as database indexes. It's not literally an index but it functions like one to update embeddings as source data gets added, deleted or changed." as a tl/dr at the top.
It wasn't clear to me why this was significantly different than using pg_vector until I read that. That makes the rest of the post (e.g. why this you need the custom methods in a `SELECT`) make a lot more sense in context
A method that has worked well for me: divorced databases.
The first database is a plaintext database that stores rows: id, data, and metadata and the second database is a vector database that stores id, embedding. whenever a new row is added the first database makes a POST request to the second database. The second database embeds the data and returns the id of its row. The first database uses that ID to store the plain text.
When searching, the second database is optimized for cosine sim with an HNSW index. It returns the IDs to the first database, which fetch the plaintext to return to the user.
The advantages of this are that the plaintext data can be A/B tested across multiple embedding models without affecting the source, and each database can be provisioned for a specific task. Also lowers hosting costs and security because there only needs to be one central vector database and small provisioned plaintext databases.
Post-co author here. This is actually something that we are considering implementing in future versions of pgai Vectorizer. You point the vectorizer at database A but tell it to create and store embeddings in database B. You can always do joins across the two databases with postgres FDWs and it would solve issues of load management if those are concerns. Neat idea and one on our radar!
Its like sqlite for vector embeddings, and you can store metadata (the primary data, foreign keys, etc) along with the vectors, preserving the relationship.
Not sure if the metadata is indexxed but at least iirc it's more or less trivial to update the embeddings when your data changes (tho i haven't used it in a while so not sure).
Good q. For most standalone vector search use cases, FAISS or a library like it is good.
However, FAISS is not a database. It can store metadata alongside vectors, but it doesn't have things you'd want in your app db like ACID compliance, non-vector indexing, and proper backup/recovery mechanisms. You're basically giving up all the DBMS capabilities.
For new RAG and search apps, many teams prefer just using a single app db with vector search capabilities included (Postgres, Mongo, MySQL etc) vs managing an app db and a separate vector db.
I've been in the vector database space for a while (primary author of txtai). I do think vector indexing in traditional databases with tools like pgvector is a good option.
txtai has long had SQLite + Faiss support to enable metadata filtering with vector search. That pattern can take you farther than you think.
The design decisions I've made is to make it easy to plug different backends in for metadata and vectors. For example, txtai supports storing both in Postgres (w/ pgvector). It also supports sqlite-vec and DuckDB.
I'm not sure there is a one-size-fits-all approach. Flexibility and options seems like a win to me. Different situations warrant different solutions.
Wow, actually a good point I haven't seen anyone make.
Taking raw embeddings and then storing them into vector databases, would be like if you took raw n-grams of your text and put them into a database for search.
Been using pgvector for a while, and to me it was kind of obvious that the source document and the embeddings are fundamentally linked so we always stored them "together". Basically anyone doing embeddings at scale is doing something similar to what Pgai Vectorizer is doing and is certainly a nice abstraction.
This is how most modern vector dbs work, you usually can store much more than just the raw embeddings (full text, metadata fields, secondary/named vectors, geospatial data, relational fields, etc).
I agree that putting the vectors in a separate DB often does not makes. Just use Hana https://news.sap.com/2024/04/sap-hana-cloud-vector-engine-ai... ;-) IMHO putting the calculation of the embedding vectors into the db (even if it is just a remote call) is not a got idea. How do you react to failures of the remote call, security issues because of code running within your DB ..?
At my current company, we used Postgres with pgvector so the text is co-located with the embeddings on the same rows. At first, I was a bit apprehensive about the idea of getting so close to the nitty-gritty technical details of computing vector embeddings and doing cosine similarity matching but actually it has been wonderful. There is something magical about working directly with embeddings. Computing, serializing and storing everything yourself is actually surprisingly simple. Don't let the magic scare you.
Recently I've been doing hardcore stuff like taking an old hierarchical clustering library and substituting the vector distance functions with a cosine similarity function so that it groups/clusters records based on similarity of their embeddings. It's funny reading the README of that 10 year old library and they're showing how to use it to do tedious stuff like grouping together 3-dimensional color vectors. I'm using it to cluster together content based on meaning similarity using vectors of over 1.5k dimensions. Somehow, I don't think the library authors saw that coming.
How great is it to come across a library which hasn't been updated in 10 years and yet is flexible and simple enough that it can be re-purposed to serve a radically more advanced use case which would have been beyond the author's imagination at the time...
I think the most surprising aspect about the whole experience is that working with the embeddings directly makes it feel like your database is intelligent; but you know it's just a plain old dumb database and all the embeddings were pre-computed.
In the project I'm currently working on, I use OpenSearch for RAG because it allows me to use hybrid search which combines full-text search with vector search, and OpenSearch does all the math combining two result sets for me. Research shows that hybrid search can give better results than just vector search alone. Another team was already integrating OpenSearch for full text search for a different feature, so I just reused exising infra, sparing the time of DevOps/SRE.
I feel like most of the points raised in the article are solved by “use pgvector”, and then I’m very skeptical of handing over responsibility for API calls for creating the embeddings to the DB itself? I already have a software layer that knows how to do things like logs, API call failures? Having the DB handle fetching data from external sources feels like the wrong abstraction to me.
I agree. However, I think what they are saying is the embedding should just be like any other index. I mean, yeah it should be but that isn't reality. There are massive latencies involved as well as costs.
Perhaps in ~10 years embedding / chunking approaches will be so mature that there will just be one way to do it and will take no more time than updating a btree but that certainly isn't the case now.
I think the right abstraction for today would be for OpenAI to manage the vector search. It is kind of weird to send all of the data to a service only to have it compute a vector and hand it back to me. I have to figure out how to chunk it etc (I'm sure they would do a better job than I would). I should just have to deal with text ideally. Someone else can figure out how to return the best results.
what's your experience with sqlite-vec? I'm considering using sqlite-vec in addition to/or replace qdrant vector db for a project (recurse.chat), since I'm moving all the data to sqlite. I love everything SQLite so far, but haven't got to try out sqlite-vec yet.
Hey, this looks great! I'm a huge fan of vectors in Postgres or wherever your data lives, and this seems like a great abstraction.
When I write a sql query that includes a vector search and some piece of logic, like:
```
select name from users where age > 21 order by <vector_similarity(users.bio, "I like long walks on the beach")> limit 10;
```
Does it filter by age first or second? I've liked the DX of pg_vector, but they do vector search, followed by filtering. It seems like that slows down what should be the superpower of a setup like this.
It could do either depending on on what the planner decides. In pgvector it usually does post-filtering in practice (filter after vector search).
pgvector HNSW has the problem that there is a cutoff of retrieving some constant C results and if none of them match the filter than it won't find results. I believe newer version of pgvector address that. Also pgvectorscale's StreamingDiskANN[1] doesn't have that problem to begin with.
Similar to blog post, instead of at the extension layer I built a PostgreSQL ORM for Node.js based on ActiveRecord + Django's ORM that includes the concept of vector fields [0][1] that lets you write code like this:
// Stores the `title` and `content` fields together as a vector
// in the `content_embedding` vector field
BlogPost.vectorizes(
'content_embedding',
(title, content) => `Title: ${title}\n\nBody: ${content}`
);
// Find the top 10 blog posts matching "blog posts about dogs"
// Automatically converts query to a vector
let searchBlogPosts = await BlogPost.query()
.search('content_embedding', 'blog posts about dogs')
.limit(10)
.select();
I find it tremendously useful; you can query the underlying data or the embedding content, and you can define how the fields in the model get stored as embeddings in the first place.
I'm curious why you consider an upside. Hypothetically speaking, wouldn't it be better if the embeddings could automatically be updated when you want them to be? Is the problem that it's not easy to automated based on the specific rules of when you want updates to happen?
We could add support for something like `pg_vectorize` in order to generate embeddings directly from the database. We simply haven't seen enough demand yet. Perhaps we haven't listened hard enough :')
We managed 200M long and short form embeddings (patents), indexed in scann at runtime and a metadata layer on leveldb. Some simple murmur hash sharding and a stable K8s cluster on GCP was all we needed. Low millisecond retrieval and rerank augmenting a primary search.
I think in 0 cases would we go back and use vector dbs or managed services if they were available to us (to include lucene or relational db add-ons)
This reads solely as a sales pitch, which quickly cuts to the "we're selling this product so you don't have to think about it."
...when you actually do want to think about it (in 2024).
Right now, we're collectively still figuring out:
1. Best chunking strategies for documents
2. Best ways to add context around chunks of documents
3. How to mix and match similarity search with hybrid search
4. Best way to version and update your embeddings
We agree a lot of stuff still needs to be figured out. Which is why we made vectorizer very configurable. You can configure chunking strategies, formatting (which is a way to add context back into chunks). You can mix semantic and lexical search on the results. That handles your 1,2,3. Versioning can mean a different version of the data (in which case the versioning info lives with the source data) OR a different embedding config, which we also support[1].
Admittedly, right now we have predefined chunking strategies. But we plan to add custom-code options very soon.
Our broader point is that the things you highlight above are the right things to worry about, not the data workflow ops and babysitting your lambda jobs. That's what we want to handle for you.
Or you can use Cassandra / DataStax Astra to store the original text, the metadata and the embedding in a single table and then do hybrid queries against them (with pre- or post-filtering, optimized automatically).
[+] [-] whakim|1 year ago|reply
First, I think you oversell the overhead of keeping data in sync and the costs of not doing so in a timely manner. Almost any distributed system that is using multiple databases already needs to have a strategy for dealing with inconsistent data. As far as this problem goes, inconsistent embeddings are a pretty minor issue given that (1) most embedding-based workflows don't do a lot of updating/deletion; and (2) the sheer volume of embeddings from only a small corpus of data means that in practice you're unlikely to notice consistency issues. In most cases you can get away with doing much less than is described in this post. That being said, I want to emphasize that I still think not having to worrying about syncing data is indeed cool.
Second, IME the most significant drawback to putting your embeddings in a Postgres database with all your other data is that the workload looks so different. To take one example, HNSW indices using pgvector consume a ton of resources - even a small index of tens of millions of embeddings may be hundreds of gigabytes on disk and requires very aggressive vacuuming to perform optimally. It's very easy to run into resource contention issues when you effectively have an index that will consume all the available system resources. The canonical solution is to move your data into another database, but then you've recreated the consistency problem that your solution purports to solve.
Third, a question: how does this interact with filtering? Can you take advantage of partial indices on the underlying data? Are some of the limitations in pgvector's HNSW implementation (as far as filtering goes) still present?
[+] [-] avthar|1 year ago|reply
Your point about HNSW being resource intensive is one we've heard. Our team actually built another extension called pgvectorscale [1] which helps scale vector search on Postgres with a new index type (StreamingDiskANN). It has BQ out the box and can also store vectors on disk vs only in memory.
Another practice I've seen work well is for teams use to use a read replica to service application queries and reduce load on the primary database.
To answer your third question, if you combine Pgai Vectorizer with pgvectorscale, the limitations around filtered search in pgvector HNSW are actually no longer present. Pgvectorscale implements streaming filtering, ensuring more accurate filtered search with Postgres. See [2] for details.
[1]: https://github.com/timescale/pgvectorscale [2]: https://www.timescale.com/blog/how-we-made-postgresql-as-fas...
[+] [-] morgango|1 year ago|reply
(Disclaimer: I work for Elastic)
Elasticsearch has recently added a data type called semantic_text, which automatically chunks text, calculates embeddings, and stores the chunks with sensible defaults.
Queries are similarly simplified, where vectors are calculated and compared internally, which makes a lot less I/O and a lot simpler client code.
https://www.elastic.co/search-labs/blog/semantic-search-simp...
[+] [-] pjot|1 year ago|reply
https://github.com/patricktrainer/duckdb-embedding-search
[+] [-] avthar|1 year ago|reply
Right now the system only supports OpenAI as an embedding provider, but we plan to extend with local and OSS model support soon.
Eager to hear your feedback and reactions. If you'd like to leave an issue or better yet a PR, you can do so here [1]
[1]: https://github.com/timescale/pgai
[+] [-] edanm|1 year ago|reply
One question - in the RAG projects we've done, most of the source data was scattered in various source systems, but wasn't necessarily imported into a single DB or Data Lake. For example, building an internal Q&A tool for a company that has knowledge stored in services like Zendesk, Google Drive, an internal company Wiki, etc.
In those cases, it made sense to not import the source documents, or only import metadata about them, and keep the embeddings in a dedicated Vector DB. This seems to me to be a fairly common use case - most enterprises have this kind of data scattered across various systems.
How do you envision this kind of use case working with this tool? I may have missed it, but you mention things like working with images, etc, is your assumption that everyone is storing all of that data in Postgres?
[+] [-] hhdhdbdb|1 year ago|reply
I say that because it seems n unsual. Index would suit sync better. But async things like embeddings, geo for an address, is this email considered a spammer etc. feel like app level stuff.
[+] [-] metalwhale|1 year ago|reply
[+] [-] mind-blight|1 year ago|reply
It wasn't clear to me why this was significantly different than using pg_vector until I read that. That makes the rest of the post (e.g. why this you need the custom methods in a `SELECT`) make a lot more sense in context
[+] [-] TechDebtDevin|1 year ago|reply
[+] [-] _bramses|1 year ago|reply
A method that has worked well for me: divorced databases.
The first database is a plaintext database that stores rows: id, data, and metadata and the second database is a vector database that stores id, embedding. whenever a new row is added the first database makes a POST request to the second database. The second database embeds the data and returns the id of its row. The first database uses that ID to store the plain text.
When searching, the second database is optimized for cosine sim with an HNSW index. It returns the IDs to the first database, which fetch the plaintext to return to the user.
The advantages of this are that the plaintext data can be A/B tested across multiple embedding models without affecting the source, and each database can be provisioned for a specific task. Also lowers hosting costs and security because there only needs to be one central vector database and small provisioned plaintext databases.
[+] [-] michaelmior|1 year ago|reply
[+] [-] avthar|1 year ago|reply
[+] [-] therealdrag0|1 year ago|reply
[+] [-] jdthedisciple|1 year ago|reply
Its like sqlite for vector embeddings, and you can store metadata (the primary data, foreign keys, etc) along with the vectors, preserving the relationship.
Not sure if the metadata is indexxed but at least iirc it's more or less trivial to update the embeddings when your data changes (tho i haven't used it in a while so not sure).
[+] [-] avthar|1 year ago|reply
However, FAISS is not a database. It can store metadata alongside vectors, but it doesn't have things you'd want in your app db like ACID compliance, non-vector indexing, and proper backup/recovery mechanisms. You're basically giving up all the DBMS capabilities.
For new RAG and search apps, many teams prefer just using a single app db with vector search capabilities included (Postgres, Mongo, MySQL etc) vs managing an app db and a separate vector db.
[+] [-] dmezzetti|1 year ago|reply
txtai has long had SQLite + Faiss support to enable metadata filtering with vector search. That pattern can take you farther than you think.
The design decisions I've made is to make it easy to plug different backends in for metadata and vectors. For example, txtai supports storing both in Postgres (w/ pgvector). It also supports sqlite-vec and DuckDB.
I'm not sure there is a one-size-fits-all approach. Flexibility and options seems like a win to me. Different situations warrant different solutions.
[+] [-] dinobones|1 year ago|reply
Taking raw embeddings and then storing them into vector databases, would be like if you took raw n-grams of your text and put them into a database for search.
Storing documents makes much more sense.
[+] [-] choilive|1 year ago|reply
[+] [-] spmurrayzzz|1 year ago|reply
[+] [-] kohlerm|1 year ago|reply
[+] [-] hmottestad|1 year ago|reply
[+] [-] cryptica|1 year ago|reply
Recently I've been doing hardcore stuff like taking an old hierarchical clustering library and substituting the vector distance functions with a cosine similarity function so that it groups/clusters records based on similarity of their embeddings. It's funny reading the README of that 10 year old library and they're showing how to use it to do tedious stuff like grouping together 3-dimensional color vectors. I'm using it to cluster together content based on meaning similarity using vectors of over 1.5k dimensions. Somehow, I don't think the library authors saw that coming.
How great is it to come across a library which hasn't been updated in 10 years and yet is flexible and simple enough that it can be re-purposed to serve a radically more advanced use case which would have been beyond the author's imagination at the time...
I think the most surprising aspect about the whole experience is that working with the embeddings directly makes it feel like your database is intelligent; but you know it's just a plain old dumb database and all the embeddings were pre-computed.
[+] [-] kgeist|1 year ago|reply
[+] [-] petesergeant|1 year ago|reply
[+] [-] osigurdson|1 year ago|reply
Perhaps in ~10 years embedding / chunking approaches will be so mature that there will just be one way to do it and will take no more time than updating a btree but that certainly isn't the case now.
I think the right abstraction for today would be for OpenAI to manage the vector search. It is kind of weird to send all of the data to a service only to have it compute a vector and hand it back to me. I have to figure out how to chunk it etc (I'm sure they would do a better job than I would). I should just have to deal with text ideally. Someone else can figure out how to return the best results.
[+] [-] markusw|1 year ago|reply
[+] [-] xyc|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] bryantwolf|1 year ago|reply
When I write a sql query that includes a vector search and some piece of logic, like: ``` select name from users where age > 21 order by <vector_similarity(users.bio, "I like long walks on the beach")> limit 10; ``` Does it filter by age first or second? I've liked the DX of pg_vector, but they do vector search, followed by filtering. It seems like that slows down what should be the superpower of a setup like this.
Here's a bit more of a complicated example of what I'm talking about: https://blog.bawolf.com/p/embeddings-are-a-good-starting-poi...
[+] [-] cevian|1 year ago|reply
It could do either depending on on what the planner decides. In pgvector it usually does post-filtering in practice (filter after vector search).
pgvector HNSW has the problem that there is a cutoff of retrieving some constant C results and if none of them match the filter than it won't find results. I believe newer version of pgvector address that. Also pgvectorscale's StreamingDiskANN[1] doesn't have that problem to begin with.
[1]: https://www.timescale.com/blog/how-we-made-postgresql-as-fas...
[+] [-] jeffchuber|1 year ago|reply
[+] [-] gillesjacobs|1 year ago|reply
[+] [-] keithwhor|1 year ago|reply
Similar to blog post, instead of at the extension layer I built a PostgreSQL ORM for Node.js based on ActiveRecord + Django's ORM that includes the concept of vector fields [0][1] that lets you write code like this:
I find it tremendously useful; you can query the underlying data or the embedding content, and you can define how the fields in the model get stored as embeddings in the first place.[0] https://github.com/instant-dev/orm?tab=readme-ov-file#using-...
[1] https://github.com/instant-dev/orm?tab=readme-ov-file#using-...
[+] [-] mkaszkowiak|1 year ago|reply
Personally, I can vouch for ParadeDB: https://www.paradedb.com/
It adds extra extensions to PostgreSQL which enable vector indexing, full text search and BM25. Works great and developers are helpful!
The major difference is that you must generate the embeddings by yourself, but I consider it an upside - to each their own :)
[+] [-] michaelmior|1 year ago|reply
I'm curious why you consider an upside. Hypothetically speaking, wouldn't it be better if the embeddings could automatically be updated when you want them to be? Is the problem that it's not easy to automated based on the specific rules of when you want updates to happen?
[+] [-] philippemnoel|1 year ago|reply
[+] [-] ok123456|1 year ago|reply
[+] [-] unholyguy001|1 year ago|reply
[+] [-] cevian|1 year ago|reply
[+] [-] ramoz|1 year ago|reply
I think in 0 cases would we go back and use vector dbs or managed services if they were available to us (to include lucene or relational db add-ons)
[+] [-] mattxxx|1 year ago|reply
...when you actually do want to think about it (in 2024).
Right now, we're collectively still figuring out:
[+] [-] cevian|1 year ago|reply
We agree a lot of stuff still needs to be figured out. Which is why we made vectorizer very configurable. You can configure chunking strategies, formatting (which is a way to add context back into chunks). You can mix semantic and lexical search on the results. That handles your 1,2,3. Versioning can mean a different version of the data (in which case the versioning info lives with the source data) OR a different embedding config, which we also support[1].
Admittedly, right now we have predefined chunking strategies. But we plan to add custom-code options very soon.
Our broader point is that the things you highlight above are the right things to worry about, not the data workflow ops and babysitting your lambda jobs. That's what we want to handle for you.
[1]: https://www.timescale.com/blog/which-rag-chunking-and-format...
[+] [-] torsstei|1 year ago|reply
[+] [-] pkolaczk|1 year ago|reply