top | item 43485264

(no title)

alexgarcia-xyz | 11 months ago

I really dig more content about how vector databases/tools handle problems like this!

In sqlite-vec, there's only a flat brute-force index (though DiskANN/IVF will be coming soon). But we do have a concept of partition keys[0], which allows you to "internally shard" the vector index based on a user_id or any other value.

   create virtual table vec_documents using vec0(
      document_id integer primary key,
      user_id integer partition key,
      contents_embedding float[1024]
   )
Then at query time, any WHERE constraints on a partition key are pushed-down, so only matching vectors are searched instead of the entire index.

  select
    document_id,
    user_id,
    distance
  from vec_documents
  where contents_embedding match :query
    and k = 20
    and user_id = 123; -- only search documents from user 123
Definitely not as performant as a proper vector index, but a lot of real-world application have these natural groups anyway. Like "search only English documents" or "search entries in this workspace only", or even "search comments only from the past 30 days."

Internally sqlite-vec stores vectors in "chunks", so when partition keys are definds, every chunk is associated with a unique combination of all partition keys. Kinda hard to describe, but if you create a vec0 virtual table and insert some values, you can inspect the internal "shadow tables" in the SQLite database to see how it's all stored.

[0] https://alexgarcia.xyz/sqlite-vec/features/vec0.html#partiti...

discuss

order

ankitml|11 months ago

Well, I have a brute force strategy for pgvector working reasonably well. Individual, partial indexes. It works for all those queries with category_id=<> clauses. You only need an index for larger categories, for categories with rows below a threshold you dont need index a KNN/dot product would work.