top | item 37748146

(no title)

aiappreciator | 2 years ago

It is true that every major DB ventor, SQL or not, is smashing the AI/vector keyword on their front pages. In Elastic for example, their vector capabilities have gone from laughable to respectable in a year. Its a lot simpler to just use one DB instead of many.

But a question for true DB experts here:

1. Is there any real advantage to building a dedicated vector DB from scratch?

2. Is vector DB something that can be just 'tacked on' to a normal DB with no major performance penalties?

We know from history, that data warehouses are genuinely different from databases, and cloud data warehouses are overwhelmingly superior to on-prem ones. So that emerged as a distinct, enduring category with Snowflake/Databricks/Bigquery.

discuss

order

jamesblonde|2 years ago

Data warehouses are columnar stores. They are very different from row-oriented databases - like Postgres, MySQL. Operations on columns - e.g., aggregations (mean of a column) are very efficient.

Most vector databases use one of a few different vector indexing libraries - FAISS, hnswlib, and scann (google only) are popular. The newer vector dbs, like weaviate, have introduced their own indexes, but i haven't seen any performance difference -

Reference: https://ann-benchmarks.com/

andris9|2 years ago

Elastic does a great job with it. My one-person company builds software that mirrors emails from IMAP accounts to ElasticSearch, and adding a vector search on top of that data to "chat" with emails was fairly simple. I was expecting there to be an untold number of hurdles, but the only requirement was to have at least v8.8.0 of ElasticSearch (this was when they increased the supported vector sizes so that OpenAI embeddings would fit into it), and that's it. https://docs.emailengine.app/chat-with-emails-using-emaileng...

redwood|2 years ago

Single node database systems that are not horizontally scalable and that are not built on a distributed system foundation (e.g. Postgres) will certainly have scaling bottlenecks if you just add more and more complexity to the workload... however many modern database systems are built on a distributed system foundation with horizontal scaling and the ability to independently scale different constituent parts of the backend.. these engines should have no problem

charcircuit|2 years ago

The trade off that you are interested in isn't about storing vectors, but rather about whether an index should be a part of the DBMS or external to it.

Some advantages of having a separate index is that it can work with different backends, it can be independently scaled, and it can index data for more than 1 database server.

Some disadvantages are increased latency, increased complexity, and distributed system problems.

hobs|2 years ago

After spending the last six months working with a vector database I qualified postgres with its vector extensions this morning and I am trying to toss out everything else.

The operational pains if you need to self host this stuff are real, split brain, backup/restore not really considered (compared to a normal databases features), things like replication and sharding _exist_ but often are a buggy mess.

OLAP is definitely distinct from OLTP, and most of these vector queries have some aspect of both - they are similar to OLAP in that they need a decent amount of preprocessing to be useful (inferrence) and they are similar to OLTP in that they are often used for serving point queries or tiny lookups.