mathis-l | 1 month ago | on: Git Rebase for the Terrified
mathis-l's comments
mathis-l | 8 months ago | on: Show HN: Luna Rail – Treating night trains as a spatial optimization problem
mathis-l | 10 months ago | on: Show HN: Chonky – a neural approach for text semantic chunking
It uses a similar approach but the focus is on sentence/paragraph segmentation generally and not specifically focused on RAG. It also has some benchmarks. Might be a good source of inspiration for where to take chonky next.
mathis-l | 2 years ago | on: Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?
Disclaimer: I work at deepset
mathis-l | 2 years ago | on: Introduction to vector similarity search (2022)
Haystack allows you to pre-process your documents into smaller chunks, calculate embeddings and index them into a document store. You can wrap all of that in a modular pipeline if you want.
Next, you can query your documents using a retrieval pipeline.
Regarding document store selection: Replacing your document store is easy, so I would start with the most simple one, probably an InMemoryDocumentStore. When you want to move from experimentation to production, you‘ll want to tailor your selection to your use case. Here‘s a few things that I‘ve observed.
You don’t want to manage anything and are fine with SaaS -> Pinecone
You have a very large dataset (500M+ vectors) and you want something that you can run locally -> maybe Qdrant
You have meta data that you want to incorporate into your retrieval or you want to do hybrid search -> Opensearch/Elasticsearch
Regarding model selection:
We‘ve seen https://huggingface.co/sentence-transformers/multi-qa-distil... work well for a good semantic search baseline with fast indexing times. If you feel like the performance is lacking, you could look at the E5 models. What also works fairly well for us is a multi-step retrieval process where we retrieve ~100 documents with BM25 first and then use a cross-encoder to rank these by semantic relevance. Very fast indexing times are a benefit and you also don’t need a beefy vector db to store your documents. Latency at query time will be slightly higher though and you might need a GPU machine to run your query pipeline.
Retrieval in Haystack: https://docs.haystack.deepset.ai/docs/retriever
Cross-Encoder approach: https://docs.haystack.deepset.ai/docs/ranker
Blog Post with an end-to-end retrieval example: https://haystack.deepset.ai/blog/how-to-build-a-semantic-sea...