top | item 30425599

A gentle introduction to vector databases

133 points| fzliu | 4 years ago |frankzliu.com

33 comments

order
[+] joexner|4 years ago|reply
Vector indices are the novel part of vector databases. Let's hear more about them. The rest is just BLOB CRUD.
[+] dontreact|4 years ago|reply
The way that vector indices work typically can make doing CRUD with them a real challenge. There is definitely novelty in being able to do both ANN indexing and fast high throughput CRUD.

In addition, the R of crud is hard to combine with vector indices. Case in point I am still waiting for elastic search to support both ANN and regular, structured filtering together well.

[+] mrintellectual|4 years ago|reply
Thanks for your feedback. I'm writing a post on vector indices and will throw it up this week.
[+] mrintellectual|4 years ago|reply
As mentioned in the article, I recommend Milvus (https://milvus.io) - it's open source and cloud native with standalone versions available. Alternatively, if you're looking for an open-source solution for generating embeddings, I recommend (https://github.com/towhee-io/towhee).
[+] dtjohnnyb|4 years ago|reply
One downside for milvus is that version 1 doesn't do filtering (necessary for most search applications) and version 2 is significantly slower. Google's vector nearest neighbors offering, weaviate, and Vespa are much better options if you're expecting to extend to more realistic workloads
[+] cbsmith|4 years ago|reply
Everything old is new again. ;-)
[+] gk1|4 years ago|reply
This is a great writeup, and awesome to see vector databases come up more and more often.

For anyone interested in going down this rabbit hole, we have an entire learning center about vector databases and vector search (https://www.pinecone.io/learn/) including the obligatory "What is a Vector Database" intro with example notebooks: https://www.pinecone.io/learn/vector-database/

[+] dang|4 years ago|reply
You've posted several comments in this thread alone linking to your product, and it seems that the majority of your posts have been doing this for quite a while now. I'm sure it's excellent work, but can you please stop doing this?

It's fine to link to your own work occasionally, when it's particularly relevant, as part of a diverse mix of posts on unrelated things*. It's not ok to use HN primarily for promotion. See https://news.ycombinator.com/newsguidelines.html: "Please don't use HN primarily for promotion. It's ok to post your own stuff occasionally, but the primary use of the site should be for curiosity."

When people do that we eventually start penalizing their accounts and sites, or in egregious cases, banning them. You're a good HN user, but this is still excessive. You're crossing the line at which the community starts to think of the word 'spam', and we inevitably start getting emails about it.

* I do get that your work is particularly relevant in a thread like this. What's missing is the 'diverse mix of posts on unrelated things'. In such a context, posting repeatedly about your own stuff starts to come across the wrong way.

[+] starkd|4 years ago|reply
Thank you for this. One approach I find missing in your blog is that of distance-based indexing. It's an approach that indexes vectors according to distances from chosen vantage points from within the data set. I've done some preliminary work on creating a system for images: phash.dev
[+] liminal|4 years ago|reply
Pinecone looks great. Any plans to have a non-hosted option?