Ask HN: Is RAG the Future of LLMs?

[+] gandalfgeek|1 year ago|reply

#1 motivation for RAG: you want to use the LLM to provide answers about a specific domain. You want to not depend on the LLM's "world knowledge" (what was in its training data), either because your domain knowledge is in a private corpus, or because your domain's knowledge has shifted since the LLM was trained.

The latest connotation of RAG includes mixing in real-time data from tools or RPC calls. E.g. getting data specific to the user issuing the query (their orders, history etc) and adding that to the context.

So will very large context windows (1M tokens!) "kill RAG"?

- at the simple end of the app complexity spectrum: when you're spinning up a prototype or your "corpus" is not very large, yes-- you can skip the complexity of RAG and just dump everything into the window.

- but there are always more complex use-cases that will want to shape the answer by limiting what they put into the context window.

- cost-- filling up a significant fraction of a 1M window is expensive, both in terms of money and latency. So at scale, you'll want to filter out and RAG relevant info rather than indiscriminately dump everything into the window.

[+] waldrews|1 year ago|reply

We're getting large context windows, but so long as pricing is by the input token, the 'throw everything into the context window' path isn't viable. That pricing model, and the context window limits, are a consequence of the quadratic cost of transformers though, and whatever the big context models like Gemini 1.5 are doing must have an (undisclosed) workaround.

What needs to happen is a way to cheaply suspend and rehydrate the memory state of the forward pass after you've fed it a lot of tokens.

That would be a sort of light-weight/flexible/easily modifiable/versionable/real-time-editable alternative to fine tuning.

It's readily doable with the open weights LLM's, but none of them (yet) have the context length to make it really worthwhile (some of the coding LLM's have long context windows, but it doesn't solve the 'knowledge base' scenario).

From a hosting perspective, if fine tunes are like VM's, such frozen overlays are like docker containers: many versions can live on the same server, sharing the base model and differing in the overlay layer.

(a startup idea? who wants to collaborate on a proof of concept?)

[+] gavmor|1 year ago|reply

Thanks, this is how I view it, too: there will always be relevant context that was unavailable at training, eg because it didn't exist yet, because it doesn't belong to the trainers, or because it wasn't yet known to be relevant.

One of these will remain true until every person has their own pet model which is fine-tuned, on keyup, on all public data and their own personal data. Still, something heinously parametric (like regional weather on some arbitrary date) I struggle to imagine fitting into a transformer.

Edit: I can imagine every user getting a LoRA.

[+] recursive4|1 year ago|reply

Here's a question you can ask yourself: "where does my context fall within the distribution of human knowledge?" RAG is increasingly necessary as your context moves towards the tail.

[+] muratsu|1 year ago|reply

In addition to what you’ve shared, I find RAG to be useful for cases where LLM has the world knowledge (say it knows how to write javascript) but I want it to follow a certain style or dependencies (eg use function definitions vs function expressions, newest vs es6, etc). From what I’ve heard, it’s still cheaper/more performant to feed everything into the context than finetune models.

[+] 2099miles|1 year ago|reply

Yeah I say cost is the biggest thing. Why doesn’t everyone just use GPT 4 for everything or Gemini ultra + RAG with all documents in the rag system with the best embedding model

Among other things because it’s way too expensive and narrowing your scope cuts huge costs and isn’t hard to do at a high level

[+] mark_l_watson|1 year ago|reply

I wrote a book on LangChain and LlamaIndex about 14 months ago, and at the time I thought that RAG style applications were great, but now I am viewing them as being more like material for demos. I am also less enthusiastic about LangChain and LlamaIndex; they are still useful, but the libraries are a moving target and often it seems best to just code up what I need by hand. The moving target issue is huge for me, updating my book frequently has been a major time sync.

I still think LLMs are the best AI tech/tools since I started getting paid to be an AI practitioner in 1982, but that is a low bar of achievement given that some forms of Symbolic AI failed to ever scale to solve real problems.

[+] cl42|1 year ago|reply

Do you have a problem with LangChain and LlamaIndex due to their changing codebases/APIs/etc., or do you think there's a fundamental issue with RAG itself?

[+] cl42|1 year ago|reply

RAG will have a place in the LLM world, since it's a way to obtain data/facts/info for relevant queries.

Since you asked about alternatives...

(a) "World models" where LLMs structure information into code, structured data, etc. and query those models will likely be a thing. AlphaGeometry uses this[1], and people have tried to abstract this in different ways[2].

(b) Depending on how you define RAG, knowledge graphs could be a form of RAG or alternatively an alternative to them. Companies like Elemental Cognition[3] are building distinct alternatives to RAG that use such graphs and give LLMs the ability to run queries on said graphs. Another approach here is to build "fact databases" where, you structure observations about the world into standalone concepts/ideas/observations and reference those[4]. Again, similar to RAG but not quite RAG as we know it today.

[1] https://deepmind.google/discover/blog/alphageometry-an-olymp...

[2] https://arxiv.org/abs/2306.12672

[3] https://ec.ai/

[4] https://emergingtrajectories.com/

[+] 0x008|1 year ago|reply

I don't understand why knowledge graph would be an alternative to RAG? Knowledge graphs can (and are already) used as part of a RAG pipeline.

[+] grugagag|1 year ago|reply

I hope it goes away because I have a real carpet to replace it with

[+] supreetgupta|1 year ago|reply

TrueFoundry has recently introduced a new open-source framework called Cognita, which utilizes Retriever-Augmented Generation (RAG) technology to simplify the transition by providing robust, scalable solutions for deploying AI applications.

Try it out: https://github.com/truefoundry/cognita

[+] darkteflon|1 year ago|reply

Unless we’re going to paste a whole domain corpus into the context window, we’re going to continue to need some sort of “relevance function” - a means of discriminating what needs to go in from what doesn’t. That could be as simple as “document A goes in, document B doesn’t”.

That’s RAG. Doesn’t matter that you didn’t use vectors or knowledge graphs or FTS or what have you.

Then the jump from “this whole document” to “well actually I only need this particular bit” puts you immediately into the territory of needing some sort of semantic map of the document.

I don’t think it makes conceptual sense to think about using LLMs without some sort of domain relevance function.

[+] mif|1 year ago|reply

For those of us who don’t know what RAG is (including myself), RAG stands for Retrieval Augmented Generation.

From the video in this IBM post [0], I understand that it is a way for the LLM to check what its source and latest date of information is. Based on that, it could, in principle, say “I don’t know”, instead of “hallucinating” an answer. A RAG is a way to implement this feature for LLMs.

[0] https://research.ibm.com/blog/retrieval-augmented-generation...

[+] simonw|1 year ago|reply

The best way to understand RAG is that it's a prompting hack where you increase the chance that a model will answer a question correctly by pasting a bunch of text that might help into the prompt along with their question.

The art of implementing RAG is deciding what text should be pasted into the prompt in order to get the best possible results.

A popular way to implement RAG is using similarity search via vector search indexes against embeddings (which I explained at length here: https://simonwillison.net/2023/Oct/23/embeddings/). The idea is to find the content that is semantically most similar to the user's question (or the likely answer to their question) and include extracts from that in the prompt.

But you don't actually need vector indexes or embeddings at all to implement RAG.

Another approach is to take the user's question, extract some search terms from it (often by asking an LLM to invent some searches relating to the question), run those searches against a regular full-text search engine and then paste results from those searches back into the prompt.

Bing, Perplexity, Google Gemini are all examples of systems that use this trick.

[+] rldjbpin|1 year ago|reply

RAG is to me just a glued-up solution to try counter the obvious limitations of LLMs that they essentially find the next token in a very convincing fashion.

i am sure there can be newer ways to do prompt injection in an elegant way, but for the most part the llm is either summarizing the injected prompt or regurgitating it.

if the output is satisfactory, it is still more convenient than writing custom rules for answers for each kind of question you want to address.

[+] spencerchubb|1 year ago|reply

I believe RAG is a temporary hack until we figure out virtually infinite context.

I think LLM context is going to be like cache levels. The first level is small but super fast (like working memory). The next level is larger but slower, and so on.

RAG is basically a bad version of attention mechanisms. RAG is used to focus your attention on relevant documents. The problem is that RAG systems are not trained to minimize loss, it is just a similarity score.

Obligatory note that I could be wrong and it's just my armchair opinion

[+] simonw|1 year ago|reply

It's not just about context length, it's about performance.

A 100 million token context that takes an hour to start returning an answer to a prompt isn't very useful for most things.

As long as there is a relationship between the length of the context and the time it takes to produce an output, there will be a reason to be selective about what goes into that context - aka a reason to use RAG techniques.

[+] mikecaulley|1 year ago|reply

This doesn’t consider compute cost; the RAG model is much more efficient compared to infinite context length.

[+] ttul|1 year ago|reply

https://arxiv.org/html/2404.07143v1

“This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation”

[+] darby_eight|1 year ago|reply

> I believe RAG is a temporary hack until we figure out virtually infinite context.

I'd assume "large enough" context is the actual goal here, not "virtually infinite".

[+] choilive|1 year ago|reply

I think that "next level" would essentially be a RAG referential information system that gets data via search engines or databases. Maybe we will have the "google search" equivalent completely intended for LLM clients where all data is stored, searched, and returned via vector embeddings, but it could tap on exabytes of information.

[+] 2099miles|1 year ago|reply

As a concept infinite context will not be discovered. We can spin up huge models with large contexts that cost more but even 1M context is 1 expensive and 2 not 3trillion(size of some datasets)

Rag is not bad attention, rag is the user not knowing the context to give the LLM

[+] fallingknife|1 year ago|reply

That's going to be quite a while then because LLMs don't even perform well at anything close to the current context limit. They will miss important details when the context gets over a couple thousand tokens.

[+] p1esk|1 year ago|reply

It’s strange: most answers here assume the next gen models won’t be able to perform RAG on its own. IMO, it would be wise to assume the opposite - anything humans currently do to make models smarter will be built in.

[+] simonw|1 year ago|reply

A Large Language Model itself can't perform RAG: a model is a big binary blob of matrices that you run prompts against.

Anything that can do RAG is, by definition, a system that wraps an LLM with additional code that performs the retrieval.

It's the difference between ChatGPT (software that wraps a model and can extra features such as tool usage, Code Interpreter, RAG lookup via Bing etc) and GPT-4 Turbo (a model).

[+] esafak|1 year ago|reply

I want an LLM that can perform self-inspection; attribute each response to its sources. And then I would replace RAG with continuous learning.

[+] teleforce|1 year ago|reply

Another potent alternative is perhaps Differentiable Search Index (DSI) based on Transformer Memory:

Transformer Memory as a Differentiable Search Index:

https://arxiv.org/abs/2202.06991

[+] machinelearning|1 year ago|reply

Both RAG and infinite contexts in their current states are hacks.

Both waste compute because you have to re-encode things as text each time and RAG needs a lot of heuristics + a separate embedding model.

Instead, it makes a lot more sense to pre-compute KV for each document, then compute values for each query. Only surfacing values when the attention score is high enough.

The challenge here is to encode global position information in the surfaced values and to get them to work with generation. I suspect it can't be done out of the box but we it will work with training.

This approach has echoes of both infinite context length and RAG but is an intermediate method that can be parallelized and is more efficient than either one.

[+] Prosammer|1 year ago|reply

uh yeah it works out of the box, this is how most RAG systems are designed, just look at pgvector for example.

[+] sigmoid10|1 year ago|reply

The latest research suggests that the best thing you can do is RAG + finetuning on your target domain. Both give roughly equal percentage gains, but they are independent (i.e. they accumulate if you do both). As context windows constantly grow and very recent architectures move more towards linear context complexity, we'll probably see current RAG mechanisms lose importance. I can totally imagine a future where if you have a research level question about physics, you just put a ton of papers and every big graduate physics textbook into the current context instead of searching text snippets using embeddings etc.

[+] simonw|1 year ago|reply

Where is the research that says that finetuning on your target domain gives a roughly equal percentage gain to RAG? I've not seen that.

[+] nimish|1 year ago|reply

RAG is an easy way to incorporate domain knowledge into a generalized model.

It's 1000x more efficient to give it a look-aside buffer of info than to try to teach it ab initio.

Why do more work when the data is already there?

[+] cjbprime|1 year ago|reply

It's hard to imagine what could happen instead. Even with a model with infinite context, where we imagine you could supply e.g. your entire email archive with each message in order to ask questions about one email, the inference time is still proportional to each input token.

So you'd still want to use RAG as a performance optimization, even though today it's being used as more of a "there is no other way to supply enough of your own data to the LLM" must-have.

[+] nl|1 year ago|reply

In the ~2 year timeframe we'll be using RAG.

Longer term it gets more interesting.

Assuming we can solve long (approaching infinite) context, and solve the issues with reasoning over long context that LangChan correctly identified[1] then it becomes a cost and performance (speed) issue.

It is currently very very expensive to run a full scan of all knowledge for every inference call.

And there are good reasons why databases use indexes instead of table scans (ie, performance).

But maybe we find a route forward towards adaptive compute over the next two years. Then we can use low compute to find items of interest in the infinite contest window, and then use high compute to reason over them. Maybe this could provide a way forward on the cost issues at least.

Performance is going to remain an issue. It's not clear to me how solvable that is (sure you can imagine ways it could be parallelized but it seems likely there will be a cost penalty on planning that)

[1] https://blog.langchain.dev/multi-needle-in-a-haystack/

[+] 0x008|1 year ago|reply

Speaking as software-developers we have to see RAG as kind of a caching-mechanism.

Instead of computing every token every time on the whole context, we can grab a cache to make some shortcut. We do the same in software development all the time. Of course it's a performance issue.

[+] sc077y|1 year ago|reply

RAG is a fantastic solution and I think it's here to stay one way or another. Yes the libs surrounding it are lacking because the field is moving so fast and yes I'm mainly talking about LangChain. RAG is just one way of grounding, that being said I think it's Agent Workflows that will really be the killer here. The idea that you can assist or even perhaps replace an entire task fulfilling unit aka worker with an LLM assisted by RAG is going to be revolutionary.

The only issue right now is the cost. You can make a bet that GPU performance will double every year or even 6 months according to Elon. RAG addresses cost issues today aswell by only retrieving relevant context, once LLMs get cheaper and context windows widen which they will, RAG will be easier, dare I say trivial.

I would argue RAG is important today on its own and as a grounding, no pun intended, for agent workflows.

[+] haolez|1 year ago|reply

I don't think so. Token windows are always increasing and new architectures (Le Cunn is proposing some interesting stuff with world models) might make it cheaper to add knowledge to the model itself. I think it's more of a necessity of our current state of the art than something that I'd bet on.

[+] 0x008|1 year ago|reply

I think no matter how large the context windows will get and no matter how fast the inference speeds will get, there will always be break-point where the context we have will be so large, that either cost or inference time are not a good experience in real life, and we have to split up the context.

We cannot simply state that at some point in time RAG will not be necessary. Like everything in the computer science world it always will depend on our data size and the resource constraints we have.

Unless of course we can process a corpus the size of the whole internet in <1 second. However, I doubt this can be achieved in the next 20 years.

[+] redskyluan|1 year ago|reply

What I Observe: Simple RAG is Fading, but Complex RAG Will Persist and Evolve - Involving Query Rewriting, Data Cleaning, Reflection, Vector Search, Graph Search, Rerankers, and More Intelligent Chunking. Large Models Should Not Just Be Knowledge Providers, But Tool Users and Process Drivers"

[+] danielmarkbruce|1 year ago|reply

It's becoming so complex that it will stop being called RAG. It's just an application that uses an LLM as one part of it.

[+] esafak|1 year ago|reply

Do you know a good article on this?

[+] waldrews|1 year ago|reply

I work on statistical quality control methods for the hallucination problem. Model how difficult/error prone a query is, and prioritize sending it to humans to verify the LLM's answer if it's high risk. Some form of human control like that is the only way to really cut hallucinations down to something like human-equivalent level (human answers are unreliable too, and should be subject to quality control with reputation scores and incentives as well).

RAG can augment the LLM with specific knowledge, which may make it more likely to give factually correct answers in those domains, but is mostly orthogonal to the hallucination problem (except to the extent that LLM's hallucinate when asked questions on a subject they don't know).

[+] zamalek|1 year ago|reply

RAG can't create associations to data that isn't superficially (found by the indexing strategy) assosciated to the query. For example, you might query about one presidential candidate and lose out on the context of all other presidential candidates (probably a bad example, but gets the point across).

It is "search and summarize." It is not "glean new conclusions." That being said, "search and summarize" is probably good for 80%.

LoRA is an improvement, but I have seen benchmarks showing that it struggles to make as deep inferences as regular training does.

There isn't a one-size fits all... Yet.

[+] henry-aryn-ai|1 year ago|reply

Well, theoretically you should be able to replace the "search" part of "search and summarize" with more analytics-y things - counts, aggregations, joins, whatever - and throw some prompt formatting at it and I'll bet you can get some pretty good conclusions out of an LLM. Not sure you can call it RAG, but that can probably cover a good 90% of the remaining 20%

105 comments