mathis-l's comments

mathis-l | 1 month ago | on: Git Rebase for the Terrified

I’ve worked on a code base that was about 15 years old and had gone through many team changes. It was a tricky domain with lots of complicated business logic. When making changes to the code, the commit history was often the only way to figure out if certain behavior was intended and why it was implemented this way. Documentation about how the product should behave often lacked the level of detail. I was certainly always thankful when a dev that was long gone from the team had written commit messages that communicated the intent behind a change.

mathis-l | 2 years ago | on: Introduction to vector similarity search (2022)

You might want to give Haystack a try (disclaimer: I work at deepset, the company behind Haystack).

Haystack allows you to pre-process your documents into smaller chunks, calculate embeddings and index them into a document store. You can wrap all of that in a modular pipeline if you want.

Next, you can query your documents using a retrieval pipeline.

Regarding document store selection: Replacing your document store is easy, so I would start with the most simple one, probably an InMemoryDocumentStore. When you want to move from experimentation to production, you‘ll want to tailor your selection to your use case. Here‘s a few things that I‘ve observed.

You don’t want to manage anything and are fine with SaaS -> Pinecone

You have a very large dataset (500M+ vectors) and you want something that you can run locally -> maybe Qdrant

You have meta data that you want to incorporate into your retrieval or you want to do hybrid search -> Opensearch/Elasticsearch

Regarding model selection:

We‘ve seen https://huggingface.co/sentence-transformers/multi-qa-distil... work well for a good semantic search baseline with fast indexing times. If you feel like the performance is lacking, you could look at the E5 models. What also works fairly well for us is a multi-step retrieval process where we retrieve ~100 documents with BM25 first and then use a cross-encoder to rank these by semantic relevance. Very fast indexing times are a benefit and you also don’t need a beefy vector db to store your documents. Latency at query time will be slightly higher though and you might need a GPU machine to run your query pipeline.

Retrieval in Haystack: https://docs.haystack.deepset.ai/docs/retriever

Cross-Encoder approach: https://docs.haystack.deepset.ai/docs/ranker

Blog Post with an end-to-end retrieval example: https://haystack.deepset.ai/blog/how-to-build-a-semantic-sea...

page 1