top | item 42002082

(no title)

This article depicts a perfect world and links it to a solution which is fairly distant from that. I understand the wishful thinking of having a "magic box" for search infrastructure but as someone worked on web-scale search at Google for years I'd say the reality isn't that simple.

1. The real problem in embedding data lifecycle management is changing the embedding mode, which involves a migration process. You can't really solve that by simply streamline the vectorization and suddenly use a new model for new data ingested. You need the non-fancy migration process: create a new collection, batch generate new vectors with the new model, port all of them there, meanwhile doing dual write for all newly ingested documents, and switch search traffic to the new collection once batch ingestion is done. Streamlining vectorization as part of the ingestion call doesn't solve that. Though it is an interesting feature to lower mental complexity, that's why at Zilliz (a vector db startup) our product https://zilliz.com/zilliz-cloud-pipelines supports that and our open-source Milvus plans supporting streamlining API call to embedding service in 3.0 version: https://milvus.io/docs/roadmap.md. That said I must state that changing the embedding model is more difficult than what the article makes it feels like. We provide tools like bulk import to batch port a whole dataset of vector embeddings with other metadata like original text or image urls. But solving the problem with one "magic box" sounds unrealistic to me, at least not for production use cases.

2. The article linked to an implementation that does naive doc processing like chunking, but in reality people need more flexibility on parsing, doc chuncking, and choice of embedding models. That's why people need tools like LlamaIndex and unstructured.io, and write a doc processing pipeline for that.

3. Most vector DBs support storing original unstructured data with the vector embedding. For example, in Milvus users usually ingest text, the vector of the text, other labels like author, title, chunk id, publish_time. The ingestion of that data is atomic naturally as that's one single row of data. "Having data and embedding not in async" is just a false claim. When you update the document, you remove the old rows and add new rows with bundled new text and new vector. I'm not sure how it could be out-of-sync. The real problem is #1, the migration problem if you want to change the embedding model, in which case you need to wipe out all existing data's vectors as they are not compatible with new embedding model so you can't blend some docs with old embedding and some with new. You need to migrate the whole dataset to another new collection and decide when to start serving queries from the new collection.

4. Lastly, the consistency/freshness problem in search usually resides between the source data, say files on S3 or a Zendesk table, and the serving stack, say vector db. Thus to build a production ready search, it needs sophisticated syncing mechanism to detect data change from the source S3, business apps or even world-wide-web and sync them to the search indexing pipeline for processing the updates and write them to the serving stack. Tools like https://www.fivetran.com/blog/unlock-ai-powered-search-with-... can offer some help in avoiding engineering complexity of implementing that in house.

discuss

No comments yet.