Ask HN: Is RAG the Future of LLMs?
125 points| Gooblebrai | 1 year ago
What do you think? Are there any other alternatives or solutions on sight?
125 points| Gooblebrai | 1 year ago
What do you think? Are there any other alternatives or solutions on sight?
[+] [-] gandalfgeek|1 year ago|reply
The latest connotation of RAG includes mixing in real-time data from tools or RPC calls. E.g. getting data specific to the user issuing the query (their orders, history etc) and adding that to the context.
So will very large context windows (1M tokens!) "kill RAG"?
- at the simple end of the app complexity spectrum: when you're spinning up a prototype or your "corpus" is not very large, yes-- you can skip the complexity of RAG and just dump everything into the window.
- but there are always more complex use-cases that will want to shape the answer by limiting what they put into the context window.
- cost-- filling up a significant fraction of a 1M window is expensive, both in terms of money and latency. So at scale, you'll want to filter out and RAG relevant info rather than indiscriminately dump everything into the window.
[+] [-] waldrews|1 year ago|reply
What needs to happen is a way to cheaply suspend and rehydrate the memory state of the forward pass after you've fed it a lot of tokens.
That would be a sort of light-weight/flexible/easily modifiable/versionable/real-time-editable alternative to fine tuning.
It's readily doable with the open weights LLM's, but none of them (yet) have the context length to make it really worthwhile (some of the coding LLM's have long context windows, but it doesn't solve the 'knowledge base' scenario).
From a hosting perspective, if fine tunes are like VM's, such frozen overlays are like docker containers: many versions can live on the same server, sharing the base model and differing in the overlay layer.
(a startup idea? who wants to collaborate on a proof of concept?)
[+] [-] gavmor|1 year ago|reply
One of these will remain true until every person has their own pet model which is fine-tuned, on keyup, on all public data and their own personal data. Still, something heinously parametric (like regional weather on some arbitrary date) I struggle to imagine fitting into a transformer.
Edit: I can imagine every user getting a LoRA.
[+] [-] recursive4|1 year ago|reply
[+] [-] muratsu|1 year ago|reply
[+] [-] 2099miles|1 year ago|reply
Among other things because it’s way too expensive and narrowing your scope cuts huge costs and isn’t hard to do at a high level
[+] [-] mark_l_watson|1 year ago|reply
I still think LLMs are the best AI tech/tools since I started getting paid to be an AI practitioner in 1982, but that is a low bar of achievement given that some forms of Symbolic AI failed to ever scale to solve real problems.
[+] [-] cl42|1 year ago|reply
[+] [-] cl42|1 year ago|reply
Since you asked about alternatives...
(a) "World models" where LLMs structure information into code, structured data, etc. and query those models will likely be a thing. AlphaGeometry uses this[1], and people have tried to abstract this in different ways[2].
(b) Depending on how you define RAG, knowledge graphs could be a form of RAG or alternatively an alternative to them. Companies like Elemental Cognition[3] are building distinct alternatives to RAG that use such graphs and give LLMs the ability to run queries on said graphs. Another approach here is to build "fact databases" where, you structure observations about the world into standalone concepts/ideas/observations and reference those[4]. Again, similar to RAG but not quite RAG as we know it today.
[1] https://deepmind.google/discover/blog/alphageometry-an-olymp...
[2] https://arxiv.org/abs/2306.12672
[3] https://ec.ai/
[4] https://emergingtrajectories.com/
[+] [-] 0x008|1 year ago|reply
[+] [-] grugagag|1 year ago|reply
[+] [-] supreetgupta|1 year ago|reply
Try it out: https://github.com/truefoundry/cognita
[+] [-] darkteflon|1 year ago|reply
That’s RAG. Doesn’t matter that you didn’t use vectors or knowledge graphs or FTS or what have you.
Then the jump from “this whole document” to “well actually I only need this particular bit” puts you immediately into the territory of needing some sort of semantic map of the document.
I don’t think it makes conceptual sense to think about using LLMs without some sort of domain relevance function.
[+] [-] mif|1 year ago|reply
From the video in this IBM post [0], I understand that it is a way for the LLM to check what its source and latest date of information is. Based on that, it could, in principle, say “I don’t know”, instead of “hallucinating” an answer. A RAG is a way to implement this feature for LLMs.
[0] https://research.ibm.com/blog/retrieval-augmented-generation...
[+] [-] simonw|1 year ago|reply
The art of implementing RAG is deciding what text should be pasted into the prompt in order to get the best possible results.
A popular way to implement RAG is using similarity search via vector search indexes against embeddings (which I explained at length here: https://simonwillison.net/2023/Oct/23/embeddings/). The idea is to find the content that is semantically most similar to the user's question (or the likely answer to their question) and include extracts from that in the prompt.
But you don't actually need vector indexes or embeddings at all to implement RAG.
Another approach is to take the user's question, extract some search terms from it (often by asking an LLM to invent some searches relating to the question), run those searches against a regular full-text search engine and then paste results from those searches back into the prompt.
Bing, Perplexity, Google Gemini are all examples of systems that use this trick.
[+] [-] rldjbpin|1 year ago|reply
i am sure there can be newer ways to do prompt injection in an elegant way, but for the most part the llm is either summarizing the injected prompt or regurgitating it.
if the output is satisfactory, it is still more convenient than writing custom rules for answers for each kind of question you want to address.
[+] [-] spencerchubb|1 year ago|reply
I think LLM context is going to be like cache levels. The first level is small but super fast (like working memory). The next level is larger but slower, and so on.
RAG is basically a bad version of attention mechanisms. RAG is used to focus your attention on relevant documents. The problem is that RAG systems are not trained to minimize loss, it is just a similarity score.
Obligatory note that I could be wrong and it's just my armchair opinion
[+] [-] simonw|1 year ago|reply
A 100 million token context that takes an hour to start returning an answer to a prompt isn't very useful for most things.
As long as there is a relationship between the length of the context and the time it takes to produce an output, there will be a reason to be selective about what goes into that context - aka a reason to use RAG techniques.
[+] [-] mikecaulley|1 year ago|reply
[+] [-] ttul|1 year ago|reply
“This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation”
[+] [-] darby_eight|1 year ago|reply
I'd assume "large enough" context is the actual goal here, not "virtually infinite".
[+] [-] choilive|1 year ago|reply
[+] [-] 2099miles|1 year ago|reply
Rag is not bad attention, rag is the user not knowing the context to give the LLM
[+] [-] fallingknife|1 year ago|reply
[+] [-] p1esk|1 year ago|reply
[+] [-] simonw|1 year ago|reply
Anything that can do RAG is, by definition, a system that wraps an LLM with additional code that performs the retrieval.
It's the difference between ChatGPT (software that wraps a model and can extra features such as tool usage, Code Interpreter, RAG lookup via Bing etc) and GPT-4 Turbo (a model).
[+] [-] esafak|1 year ago|reply
[+] [-] teleforce|1 year ago|reply
Transformer Memory as a Differentiable Search Index:
https://arxiv.org/abs/2202.06991
[+] [-] machinelearning|1 year ago|reply
Both waste compute because you have to re-encode things as text each time and RAG needs a lot of heuristics + a separate embedding model.
Instead, it makes a lot more sense to pre-compute KV for each document, then compute values for each query. Only surfacing values when the attention score is high enough.
The challenge here is to encode global position information in the surfaced values and to get them to work with generation. I suspect it can't be done out of the box but we it will work with training.
This approach has echoes of both infinite context length and RAG but is an intermediate method that can be parallelized and is more efficient than either one.
[+] [-] Prosammer|1 year ago|reply
[+] [-] sigmoid10|1 year ago|reply
[+] [-] simonw|1 year ago|reply
[+] [-] nimish|1 year ago|reply
It's 1000x more efficient to give it a look-aside buffer of info than to try to teach it ab initio.
Why do more work when the data is already there?
[+] [-] cjbprime|1 year ago|reply
So you'd still want to use RAG as a performance optimization, even though today it's being used as more of a "there is no other way to supply enough of your own data to the LLM" must-have.
[+] [-] nl|1 year ago|reply
Longer term it gets more interesting.
Assuming we can solve long (approaching infinite) context, and solve the issues with reasoning over long context that LangChan correctly identified[1] then it becomes a cost and performance (speed) issue.
It is currently very very expensive to run a full scan of all knowledge for every inference call.
And there are good reasons why databases use indexes instead of table scans (ie, performance).
But maybe we find a route forward towards adaptive compute over the next two years. Then we can use low compute to find items of interest in the infinite contest window, and then use high compute to reason over them. Maybe this could provide a way forward on the cost issues at least.
Performance is going to remain an issue. It's not clear to me how solvable that is (sure you can imagine ways it could be parallelized but it seems likely there will be a cost penalty on planning that)
[1] https://blog.langchain.dev/multi-needle-in-a-haystack/
[+] [-] 0x008|1 year ago|reply
Instead of computing every token every time on the whole context, we can grab a cache to make some shortcut. We do the same in software development all the time. Of course it's a performance issue.
[+] [-] sc077y|1 year ago|reply
The only issue right now is the cost. You can make a bet that GPU performance will double every year or even 6 months according to Elon. RAG addresses cost issues today aswell by only retrieving relevant context, once LLMs get cheaper and context windows widen which they will, RAG will be easier, dare I say trivial.
I would argue RAG is important today on its own and as a grounding, no pun intended, for agent workflows.
[+] [-] haolez|1 year ago|reply
[+] [-] 0x008|1 year ago|reply
We cannot simply state that at some point in time RAG will not be necessary. Like everything in the computer science world it always will depend on our data size and the resource constraints we have.
Unless of course we can process a corpus the size of the whole internet in <1 second. However, I doubt this can be achieved in the next 20 years.
[+] [-] redskyluan|1 year ago|reply
[+] [-] danielmarkbruce|1 year ago|reply
[+] [-] esafak|1 year ago|reply
[+] [-] waldrews|1 year ago|reply
RAG can augment the LLM with specific knowledge, which may make it more likely to give factually correct answers in those domains, but is mostly orthogonal to the hallucination problem (except to the extent that LLM's hallucinate when asked questions on a subject they don't know).
[+] [-] zamalek|1 year ago|reply
It is "search and summarize." It is not "glean new conclusions." That being said, "search and summarize" is probably good for 80%.
LoRA is an improvement, but I have seen benchmarks showing that it struggles to make as deep inferences as regular training does.
There isn't a one-size fits all... Yet.
[+] [-] henry-aryn-ai|1 year ago|reply