Something I've been noodling on for a bit is that adding RAG - which should be a pattern most applications do - increases accuracy and reliability of an LLM while also making the whole system harder to understand and wrangle without good diagnostic data.
Without RAG, you're really just iterating on a prompt that you might parameterize with a few pieces of data. When you see user activity or get a report that something isn't working right, you can iterate fairly easily to get a desired outcome.
When you add RAG, now the behavior of the whole system is dependent on decisions you make upstream. Sure, if that's just one single call to dot(a,b) on a set of vectors, that's easy enough. But what if your RAG system involves calling out to several different databases? What if it pulls in other components that are specific to a user and might be unique per-request, all in the name of improving accuracy? Little doubt that it will improve accuracy, but now you've got a more complex, perhaps very distributed system that you can't just pull down locally and start reproducing behavior with.
This is why I really do believe that a lot more developers are going to have to embrace tracing as a means to understand their applications. My understanding is that LlamaIndex already has a tracing model of sorts internally. Being able to get that data out as OTLP and correlated with the rest of an application would be rad.
You are absolutely right. I have a built a handful of complex interconnected LLM calls, and the thing I am dying for is a simple way to inspect what is happening at each step. For instance, one chain has 5+ steps where data is fetched, transformed, sent to an LLM, and transformed again. I have a homegrown solution to seeing what's going on, but it is missing a lot of what I would want to see.
I've seen some attempts at providing RAG tracing as a service, but for my specific use case, where in some instances I am making 100+ LLM calls per chain, it isn't a fit quite yet.
> increases accuracy and reliability of an LLM while also making the whole system harder to understand
Harder than what? If the alternative is a black box fine tuned model, RAG is obviously superior. There’s no easy way to audit why an LLM spits out nonsense, and the only way we know how to make it less nonsensical is to feed reasonable inputs.
So RAG replaces a poorly understood problem (zero-shot generation from decoder-only models) with a well-understood one and a poorly understood one (search retrieval followed by zero shot generation from documents).
I think it’s a little disingenuous for all of these out of the box repos to tout their GitHub stars and claim to be “leading” in RAG. The first step is retrieval. People have been working on retrieval for decades and vector-based retrieval since before attention was all we needed.
That said, I’m glad that the prospect of text generation has forced people to become familiar with information retrieval. It’s a relatively new subject compared to the rest of CS and I really only learned about in industry.
Theres definitely going to be a lot of startups in and around this space.
I've been building a basic RAG system DIY and just built a basic admin panel that displays these traces by dumping a lot of data into a postgres table. But this type of observability, configurability, and ability to score results and get feedback loops, version, etc etc will be really key.
I'll just add that there are use cases where the extra layer actually adds a nice element of control and visibility. Adjusting a prompt to a black box LLM can sometimes lead to unpredictable results, though that can be mitigated with a good test set to some extent.
This was a great listen that really resonated with the approach we've taken in building our code AI tool (Sourcegraph's Cody: https://cody.dev). We found that the biggest levers we had to drive improvement in the accuracy of Q&A and code generation quality was fetching relevant snippets of code and docs into the context window. So we pivoted from investigating more expensive, long-iteration-cycle updates at the model training level and invested more into code-specific information retrieval mechanisms (it helps that we spent the past 10 years building a code search and code intelligence engine). We've found RAG to be cheap, fast, and directly impactful compared to model-layer improvements (but as Jerry points out, still a very hard engineering problem).
Anyway, Jerry and LlamaIndex has been a huge source of learning and inspiration—please keep tweeting and publishing, Jerry!
I still don't understand why we have a different term (RAG) for something we've been doing for a while i.e. use a vector db/embeddings coupled with LLM to generate better answers. LLamaIndex (formerly GPTIndex) has been around for a while and the Pinecone boys have been talking about this for months. Is there a breakthrough in terms of the way the pipeline is set up here?
Because "RAG" is shorter than "use a vector db/embeddings coupled with LLM to generate better answers"
But better answer: "retrieval augmented" has been used in papers since at least 2018 [1], the full "RAG" acronym since at least 2020 [2], and it was recently repopularized when the GPT3+ hype wave realized they were doing the same thing.
I think the use of this term "RAG" is counterproductive and almost makes me suspicious. Because it is used ambiguously and imprecisely, usually with the main aim being to suggest a level of expertise.
But all it really means is that you put some extra information in your prompt (from somewhere) that didn't come from the user. It's basically the first thing that anyone thinks of once they get the API call working.
People don't use it to only mean vector search, despite often assuming that is what it means.
What I have been leaning towards recently is not using vectors at all, or only using it when I have an extremely close vector match. Instead I give the API one or more functions to look up information that is like an index for documentation or retrieving some info from a database. But it's specific to the application.
I think usually the non-vector stuff is more effective.
This is a subjective and semi-informed opinion, but I don't really see that RAG is "a hack". The value of LLMs isn't their ability to regurgitate information. Computers could do that already, with databases, or even flat files. As I see it, the value of LLMs is their ability to interact with human language and abstract concepts.
There are situations where you have some limited amount of relatively static data you want to add to your model. In those situations you'll get value from fine tuning the model.
But if you have a large set of data to add, or data that updates frequently, then a model that understands human language and basic logical reasoning hooked into a RAG makes much more sense. That strikes me as an elegant solution. I don't see that as "a hack" at all.
The thing I love about talking to all the founders we've chatted with so far is their intellectual honesty and pursuit of solving real problems that people have, not just creating ones out of thin air. RAG -is- a hack, but its a very good hack, and findings like the Lost in the Middle paper and even learnings from finetuning so far make it such that RAG frameworks are here to stay.
Coincidentally I just listened to this episode a few hours ago on a long car drive! It was really good, though if you don't know what a RAG is, you'll want to study up on that first (or just read up to the transcript on that page) as they don't really describe the concept in the episode itself. Enough intelligent insights popped up for me to keep rewinding certain bits for further thought though.
Yea I tried to anchor too but it didn't work :( All transcriptions are created with https://github.com/FanaHOVA/smol-podcaster, we think most people want to skim and read before listening so we try and have high quality ones for each episode
RAG is much closer to how the human brain works than fine tuning a model is. Arguably, RAG points to the existence or the beginnings of AGI. The computer can take new information and use it to inform their answer.
> Arguably, RAG points to the existence or the beginnings of AGI. The computer can take new information and use it to inform their answer.
I disagree with that. The model isn't fetching information to augment its responses, nor does it have any notion of ground truth or accuracy. I would argue that RAG points at the inverse - we have these incredible probabilistic autocomplete machines but they're so incapable of being reliable that we have to force them to work in a context that we establish for them.
> RAG is much closer to how the human brain works than fine tuning a model is.
[citation needed]
Obviously, the human brain continuously integrates new material, but whether that's more like RAG, continuous fine-tuning, just having a really big context window, or just having a completely different model architecture where activiation (or not) of pathways in inference also produces weight adjustments directly — or doesn't meaningfully work like any of those — is another question, where the answer is less obvious.
I think RAG (storing exact data losslessly with some sort of indexing to allow retrieval, and something hidden-prompt-like directing the use of that in populating an inference context) is a lot less likely that something that involves dynamically adjusting model weights. Which also explains forgetting, not just learning.
> RAG is much closer to how the human brain works than fine tuning a model is. Arguably, RAG points to the existence or the beginnings of AGI.
I see these two claims as orthogonal. AGI need not work like the human brain, and I’m not sure people’s brains have any lossless representation of memories like a RAG system would. People are infamously unreliable when it comes to recalling information during crimes for example.
However, it does seem like a RAG system is much more versatile than continuously training. Facts change all the time and it’s infeasible to retrain every time that happens.
They already had to rename from GPTIndex. If I am allowed to be a skeptic: Pretty likely they do not want to go through a rename again. And so, putting the current name down to coincidence might be to create a paper trail in support of their claims.
[+] [-] phillipcarter|2 years ago|reply
Without RAG, you're really just iterating on a prompt that you might parameterize with a few pieces of data. When you see user activity or get a report that something isn't working right, you can iterate fairly easily to get a desired outcome.
When you add RAG, now the behavior of the whole system is dependent on decisions you make upstream. Sure, if that's just one single call to dot(a,b) on a set of vectors, that's easy enough. But what if your RAG system involves calling out to several different databases? What if it pulls in other components that are specific to a user and might be unique per-request, all in the name of improving accuracy? Little doubt that it will improve accuracy, but now you've got a more complex, perhaps very distributed system that you can't just pull down locally and start reproducing behavior with.
This is why I really do believe that a lot more developers are going to have to embrace tracing as a means to understand their applications. My understanding is that LlamaIndex already has a tracing model of sorts internally. Being able to get that data out as OTLP and correlated with the rest of an application would be rad.
[+] [-] pstorm|2 years ago|reply
I've seen some attempts at providing RAG tracing as a service, but for my specific use case, where in some instances I am making 100+ LLM calls per chain, it isn't a fit quite yet.
[+] [-] janalsncm|2 years ago|reply
Harder than what? If the alternative is a black box fine tuned model, RAG is obviously superior. There’s no easy way to audit why an LLM spits out nonsense, and the only way we know how to make it less nonsensical is to feed reasonable inputs.
So RAG replaces a poorly understood problem (zero-shot generation from decoder-only models) with a well-understood one and a poorly understood one (search retrieval followed by zero shot generation from documents).
I think it’s a little disingenuous for all of these out of the box repos to tout their GitHub stars and claim to be “leading” in RAG. The first step is retrieval. People have been working on retrieval for decades and vector-based retrieval since before attention was all we needed.
That said, I’m glad that the prospect of text generation has forced people to become familiar with information retrieval. It’s a relatively new subject compared to the rest of CS and I really only learned about in industry.
[+] [-] ernestipark|2 years ago|reply
I've been building a basic RAG system DIY and just built a basic admin panel that displays these traces by dumping a lot of data into a postgres table. But this type of observability, configurability, and ability to score results and get feedback loops, version, etc etc will be really key.
[+] [-] totalhack|2 years ago|reply
[+] [-] beyang|2 years ago|reply
Anyway, Jerry and LlamaIndex has been a huge source of learning and inspiration—please keep tweeting and publishing, Jerry!
[+] [-] bilater|2 years ago|reply
[+] [-] sprobertson|2 years ago|reply
But better answer: "retrieval augmented" has been used in papers since at least 2018 [1], the full "RAG" acronym since at least 2020 [2], and it was recently repopularized when the GPT3+ hype wave realized they were doing the same thing.
[1] https://arxiv.org/abs/1808.03430 [2] https://arxiv.org/abs/2005.11401
[+] [-] ilaksh|2 years ago|reply
But all it really means is that you put some extra information in your prompt (from somewhere) that didn't come from the user. It's basically the first thing that anyone thinks of once they get the API call working.
People don't use it to only mean vector search, despite often assuming that is what it means.
What I have been leaning towards recently is not using vectors at all, or only using it when I have an extremely close vector match. Instead I give the API one or more functions to look up information that is like an index for documentation or retrieving some info from a database. But it's specific to the application.
I think usually the non-vector stuff is more effective.
[+] [-] loudmax|2 years ago|reply
There are situations where you have some limited amount of relatively static data you want to add to your model. In those situations you'll get value from fine tuning the model.
But if you have a large set of data to add, or data that updates frequently, then a model that understands human language and basic logical reasoning hooked into a RAG makes much more sense. That strikes me as an elegant solution. I don't see that as "a hack" at all.
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] swyx|2 years ago|reply
The thing I love about talking to all the founders we've chatted with so far is their intellectual honesty and pursuit of solving real problems that people have, not just creating ones out of thin air. RAG -is- a hack, but its a very good hack, and findings like the Lost in the Middle paper and even learnings from finetuning so far make it such that RAG frameworks are here to stay.
[+] [-] petercooper|2 years ago|reply
[+] [-] swyx|2 years ago|reply
[+] [-] mst|2 years ago|reply
I was going to post https://www.latent.space/p/llamaindex#transcription as a shortcut to it but apparently that just confuses the on-page JS and it resets to #details instead.
[+] [-] FanaHOVA|2 years ago|reply
[+] [-] ldjkfkdsjnv|2 years ago|reply
[+] [-] phillipcarter|2 years ago|reply
I disagree with that. The model isn't fetching information to augment its responses, nor does it have any notion of ground truth or accuracy. I would argue that RAG points at the inverse - we have these incredible probabilistic autocomplete machines but they're so incapable of being reliable that we have to force them to work in a context that we establish for them.
[+] [-] dragonwriter|2 years ago|reply
[citation needed]
Obviously, the human brain continuously integrates new material, but whether that's more like RAG, continuous fine-tuning, just having a really big context window, or just having a completely different model architecture where activiation (or not) of pathways in inference also produces weight adjustments directly — or doesn't meaningfully work like any of those — is another question, where the answer is less obvious.
I think RAG (storing exact data losslessly with some sort of indexing to allow retrieval, and something hidden-prompt-like directing the use of that in populating an inference context) is a lot less likely that something that involves dynamically adjusting model weights. Which also explains forgetting, not just learning.
[+] [-] janalsncm|2 years ago|reply
I see these two claims as orthogonal. AGI need not work like the human brain, and I’m not sure people’s brains have any lossless representation of memories like a RAG system would. People are infamously unreliable when it comes to recalling information during crimes for example.
However, it does seem like a RAG system is much more versatile than continuously training. Facts change all the time and it’s infeasible to retrain every time that happens.
[+] [-] wey-gu|2 years ago|reply
Great episode!
[+] [-] ignoramous|2 years ago|reply