top | item 39429606

(no title)

rustyboy | 2 years ago

Curious at how people are using vectordbs at an enterprise level.

Let's say you have a team of 5 data scientists/developers who are working on a collection of GenAI features/tooling. Does it make sense to have one single vectordb where all documentation is embedded and powers all the apps, or do you make a bunch of niche databases that are tailored to the service?

Also, one of the things i've noticed is that these databases seem less optimized for update operations so when user #1 embeds and saves 100 documents then user #2 does the same, with 10 overlapping - I'd guess that doubling of the similiarity space would exclude new documents. How are people handling that?

discuss

falling_myshkin|2 years ago

I am not sure what you mean specifically by 'overlapping'. But high-dimensional vector space is really "big" in the sense that everything is way closer together compared to low dimensions (this is the curse of dimensionality for euclidean norm), and this is already something one has to think about regardless of the similarity of the source documents. From reading wikipedia it seems like it's been argued that the curse is the worst with independent and uniformly distributed features.

swalsh|2 years ago

We have multiple teams kind of working in silos so we haven't really consolidated on a single enterprise solution yet. That said, the team I'm on has consolidated on using qdrant with different collections. We've also started using a sort of hungarian notation in collection names as we've just ran into the problem of multiple embedding models.

ofermend|2 years ago

Curious to hear what criteria each team considers important (top-3) for choosing the VDB they chose. There are so many vector databases available and in my experience it's actually not the most critical component in the overall GenAI/RAG architecture, although it gets the most attention.