top | item 42314212

Show HN: Open-Source Colab Notebooks to Implement Advanced RAG Techniques

98 points| hbamoria | 1 year ago |github.com

Hey HN fam,

We’ve seen developers spend a lot of time implementing advanced RAG techniques from scratch.

While these techniques are essential for improving performance, their implementation requires a lot of effort and testing!

To help with this process, our team (Athina AI) has released Open-Source Advanced RAG Cookbooks.

This is a collection of ready-to-run Google Colab notebooks featuring the most commonly implemented techniques.

Please show us some love by starring the repo if you find this useful!

27 comments

Oras|1 year ago

One of the challenges I have with RAG is excluding table of contents, headers/footers and appendices from PDFs.

Is there a tool/technique to achieve this? I’m aware that I can use LLMs to do so, or read all pages and find identical text (header/footer), but I want to keep the page number as part of the metadata to ensure better citation on retrieval.

prsdm|1 year ago

This might help you: https://github.com/langchain-ai/langchain/blob/master/cookbo...

jonathan-adly|1 year ago

I would check out vision models as a technique to go around OCR errors.

ColPali is the standard implementation & SOTA. Much better than OCR. We maintain a ready to go retrieval API that implements this: https://github.com/tjmlabs/ColiVara

throwup238|1 year ago

You’ll need other heuristics for ToC and indices but headers/footers are easy to detect via n-gram deduplication. You’ll want to figure out some rolling logic to handle chapter changes though.

jonathan-adly|1 year ago

I would strongly advise against people learning based on LangChain.

It is abstraction hell, and will set you back thousands of engineers hours the moment you want to do something differently.

RAG is actually very simple thing to do; just too much VC money in the space & complexity merchants.

Best way to learn is outside of notebooks (the hard parts of RAG is all around the actual product), and use as little frameworks as possible.

My preferred stack is a FastAPI/numpy/redis. Simple as pie. You can swap redis for pgVector/Postgres when ready for the next complexity step.

ellisv|1 year ago

I'd like to hear more about this – both your reasoning against LangChain and suggestions for alternatives.

My experience with LangChain has been a mixed bag. On the one hand it has been very easy to get up and running quickly. Following their examples actually works!

Trying to go beyond the examples to mix and match concepts was a real challenge because of the abstractions. As with any young framework in a fast moving field the concepts and abstractions seem to be changing quickly, thus examples within the documentation show multiple ways to do something but it isn't clear which is the "right" way.

jackmpcollins|1 year ago

I'd be really interested to hear what abstractions you would find useful for RAG. I'm building magentic which is focused on structured outputs and streaming, but also enables RAG [0], though currently has no specific abstractions for it.

[0] https://magentic.dev/examples/rag_github/

pchangr|1 year ago

Those were exactly my thoughts.. however I haven’t been able to find much material on how to implement this without relying on LangChain.. do you know of any beginners material I could use to fill my gaps?

Jet_Xu|1 year ago

Interesting discussion! While RAG is powerful for document retrieval, applying it to code repositories presents unique challenges that go beyond traditional RAG implementations. I've been working on a universal repository knowledge graph system, and found that the real complexity lies in handling cross-language semantic understanding and maintaining relationship context across different repo structures (mono/poly).

Has anyone successfully implemented a language-agnostic approach that can: 1. Capture implicit code relationships without heavy LLM dependency? 2. Scale efficiently for large monorepos while preserving fine-grained semantic links? 3. Handle cross-module dependencies and version evolution?

Current solutions like AST-based analysis + traditional embeddings seem to miss crucial semantic contexts. Curious about others' experiences with hybrid approaches combining static analysis and lightweight ML models.

krawczstef|1 year ago

+1 for vanilla code without LangChain.

hbamoria|1 year ago

I believe you're looking for notebooks w/o Langchain. We plan to publish them in next few days :)

imworkingrn|1 year ago

whats wrong with langchain ?

chompychop|1 year ago

Huh? All of their notebooks use LangChain.

dmezzetti|1 year ago

Thanks for sharing.

If you want notebooks that do some of this with local open models: https://github.com/neuml/txtai/tree/master/examples and here: https://gist.github.com/davidmezzetti

prsdm|1 year ago

Thanks for sharing these resources! We’ll definitely take a look.

unknown|1 year ago

[deleted]