mathis-l's comments

mathis-l | 1 month ago | on: Git Rebase for the Terrified

I’ve worked on a code base that was about 15 years old and had gone through many team changes. It was a tricky domain with lots of complicated business logic. When making changes to the code, the commit history was often the only way to figure out if certain behavior was intended and why it was implemented this way. Documentation about how the product should behave often lacked the level of detail. I was certainly always thankful when a dev that was long gone from the team had written commit messages that communicated the intent behind a change.

mathis-l | 8 months ago | on: Show HN: Luna Rail – Treating night trains as a spatial optimization problem

I’m 190cm and tested luna rail’s prototypes. I was amazed how much space I had, even in the smallest cabin. Definitely much better compared to any night train experience I’ve ever had

mathis-l | 10 months ago | on: Show HN: Chonky – a neural approach for text semantic chunking

You might want to take a look at https://github.com/segment-any-text/wtpsplit

It uses a similar approach but the focus is on sentence/paragraph segmentation generally and not specifically focused on RAG. It also has some benchmarks. Might be a good source of inspiration for where to take chonky next.

mathis-l | 2 years ago | on: Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?

Haystack [1] is another good option. It‘s modular, doesn’t get in your way and is particularly strong at retrieval. People like the documentation too.

Disclaimer: I work at deepset

[1] https://github.com/deepset-ai/haystack

mathis-l | 2 years ago | on: Introduction to vector similarity search (2022)

You might want to give Haystack a try (disclaimer: I work at deepset, the company behind Haystack).

Haystack allows you to pre-process your documents into smaller chunks, calculate embeddings and index them into a document store. You can wrap all of that in a modular pipeline if you want.

Next, you can query your documents using a retrieval pipeline.

Regarding document store selection: Replacing your document store is easy, so I would start with the most simple one, probably an InMemoryDocumentStore. When you want to move from experimentation to production, you‘ll want to tailor your selection to your use case. Here‘s a few things that I‘ve observed.

You don’t want to manage anything and are fine with SaaS -> Pinecone

You have a very large dataset (500M+ vectors) and you want something that you can run locally -> maybe Qdrant

You have meta data that you want to incorporate into your retrieval or you want to do hybrid search -> Opensearch/Elasticsearch

Regarding model selection:

We‘ve seen https://huggingface.co/sentence-transformers/multi-qa-distil... work well for a good semantic search baseline with fast indexing times. If you feel like the performance is lacking, you could look at the E5 models. What also works fairly well for us is a multi-step retrieval process where we retrieve ~100 documents with BM25 first and then use a cross-encoder to rank these by semantic relevance. Very fast indexing times are a benefit and you also don’t need a beefy vector db to store your documents. Latency at query time will be slightly higher though and you might need a GPU machine to run your query pipeline.

Retrieval in Haystack: https://docs.haystack.deepset.ai/docs/retriever

Cross-Encoder approach: https://docs.haystack.deepset.ai/docs/ranker

Blog Post with an end-to-end retrieval example: https://haystack.deepset.ai/blog/how-to-build-a-semantic-sea...