top | item 45553577

Meta Superintelligence Labs' first paper is about RAG

423 points| skadamat | 4 months ago |paddedinputs.substack.com

https://arxiv.org/abs/2509.01092

271 comments

order

ipsum2|4 months ago

This has nothing to do with superintelligence, it's just the people that were working on the paper prior to the re-org happened to publish after the name change.

Though it is notable that contrary to many (on HN and Twitter) that Meta would stop publishing papers and be like other AI labs (e.g. OpenAI). They're continued their rapid pace of releasing papers AND open source models.

pityJuke|4 months ago

What model(s) have Meta released since the Lab re-org?

Also, that wasn't based on purely hearsay, Zuck explicitly said:

> We believe the benefits of superintelligence should be shared with the world as broadly as possible. That said, superintelligence will raise novel safety concerns. We'll need to be rigorous about mitigating these risks and careful about what we choose to open source. Still, we believe that building a free society requires that we aim to empower people as much as possible. [0]

[0]: https://www.meta.com/superintelligence/

ekianjo|4 months ago

Open weights models, not open source. And even their weights are under a specific license not as permissive as apache 2.

RataNova|4 months ago

Still, I think the optics matter... the fact that Meta's still putting out technical work (and open sourcing it) after the restructure says a lot about where they want to position themselves

Zacharias030|4 months ago

Should be the top comment.

MSL is not only those few high profile hires.

godelski|4 months ago

It's kinda funny, Meta has long had some of the best in the field, but left them untapped. I really think if they just took a step back and stop being so metric focused and let their people freely explore then they'd be winning the AI race. But with this new team, I feel like meta mostly hired the people who are really good at gaming the system. The people that care more about the money than the research.

A bit of this is true at every major lab. There's tons of untapped potential. But these organizations are very risk adverse. I mean why not continue with the strategy that got us to the point we're at in the first place. Labs used to hire researchers and give them a lot of free reign. But those times ended and AI progress also slowed down. Maybe if you want to get ahead you gotta stop thinking like everyone else

Well meta... you can "hold me hostage" for a lot cheaper than those guys. I'm sure this is true for hundreds of passionate ML researchers. I'd take a huge pay cut to have autonomy and resources. I know for a fact there's many working at Mets right now that would do the same. Do maybe if you're going to throw money at the problem, diversify a bit and look back at what made SV what it is today and what made AI take leaps forward

hamasho|4 months ago

My theory is that as more people compete, the top candidates become those who are best at gaming the system rather than actually being the best. Someone has probably studied this. My only evidence is job applications for GAFAM and Tinder tho.

contrarian1234|4 months ago

> Labs used to hire researchers and give them a lot of free reign.

I can't think of it ever really paying off. Bell Labs is the best example. Amazing research that was unrelated to the core business off the parent company. Microsoft Research is another great one. Lots of interesting research that .. got MS some nerd points? But has materialized into very very few actual products and revenue streams. Moving AI research doesn't help Meta build any motes or revenue streams. It just progresses our collective knowledge.

On the "human progress" scale it's fantastic to put lots of smart people in a room and let them do their thing. But from a business perspective it seems to almost never pay off. Waiting on the irrational charity of businesses executive is probably not the best way to structure thing.

I'd tell them to go become academics.. but all the academics I know are just busy herding their students and attending meetings

didip|4 months ago

I always wonder about that. Those $100m Mathematicians... how can they have rooms to think under Meta's crushing IMPACT pressure?

RataNova|4 months ago

The money chase is real. You can kind of tell who's in it for the comp package vs. who'd be doing the same work on a laptop in their garage if that's what it took

zer0zzz|4 months ago

> I really think if they just took a step back and stop being so metric focused and let their people freely explore then they'd be win..

This is very true, and more than just in ai.

I think if they weren’t so metric focused they probably wouldn’t have hit so much bad publicity and scandal too.

bboygravity|4 months ago

AI progress has slowed down?! By what metric?

Quite the statement for anybody who follows developments (without excluding xAI).

rhetocj23|4 months ago

"Maybe if you want to get ahead you gotta stop thinking like everyone else"

Well for starters you need a leader who can rally the troops who "think(s) different" - something like a S Jobs.

That person doesnt seem to exist in the industry right now.

ProofHouse|4 months ago

winning the AI race? Meta? Oh that was a good one. Zuck is a follower not a leader. It is in his DNA

bobxmax|4 months ago

I thought Alex Wang was a very curious choice. There are so many foundational AI labs with interesting CEOs... I get that Wang is remarkable in his own right, but he basically just built MTurk and timed the bubble.

Doesn't really scream CEO of AGI to me.

mark_l_watson|4 months ago

A great idea, bypassing as much conversion as possible between vector space and natural language tokens. Reminds me of a discussion of having AI’s “talk” to each other using vector space.

There was an interesting quote “plain old BM25 from 1994 outperforms vector search on recall” and super relevant to what I did yesterday. I am trying to use small local models more often and yesterday I wrote Common Lisp code that uses a large corpus of text and a user query or prompt to construct a fairly concise one-shot prompt with select context from the text corpus. This is RAG, and I used both BM25 and vector embeddings matching. I added the code and an example as a new chapter in my CL book (link directly to new material: https://leanpub.com/lovinglisp/read#leanpub-auto-autocontext...) yesterday afternoon. BM25 is fast. This is new code, and I will certainly be experimenting more with it, but as-is it is useful when working with small local LLMs.

schmorptron|4 months ago

One thing I don't get about the ever-reoccuring RAG discussions and hype men proclaiming "Rag is dead", is that people seem to be talking about wholly different things? My mental model is that what is called RAG can either be:

- a predefined document store / document chunk store where every chunk gets a a vector embedding, and a lookup decides what gets pulled into context as to not have to pull whole classes of document, filling it up

- the web search like features in LLM chat interfaces, where they do keyword search, and pull relevant documents into context, but somehow only ephemerally, with the full documents not taking up context in the future of the thread (unsure about this, did I understand it right?) .

with the new models with million + tokens of context windows, some where arguing that we can just throw whole books into the context non-ephemerally, but doesnt that significantly reduce the diversity of possible sources we can include at once if we hard commit to everything staying in context forever? I guess it might help with consistency? But is the mechanism with which we decide what to keep in context not still some kind of RAG, just with larger chunks of whole documents instead of only parts?

I'd be extatic if someone who really knows their stuff could clear this up for me

kgeist|4 months ago

Technically, RAG is anything that augments generation with external search. However, it often has a narrower meaning: "uses a vector DB."

Throwing everything into one large context window is often impractical - it takes much more time to process, and many models struggle to find information accurately if too much is going on in the context window ("lost in the middle").

The "classic" RAG still has its place when you want low latency (or you're limited by VRAM) and the results are already good enough.

impossiblefork|4 months ago

We can't throw in infinite things in the context though.

My impression is that GPT-5 gets confused, not quite right away, but after a couple of pages it has no idea. It doesn't take pages on pages before it forgets things.

GistNoesis|4 months ago

The answer is adaptability.

In both cases for "Question Answering" it's about similarity search but there are two main orthogonal differences between RAG and Non-RAG :

-Knowing the question at the time of index building

-Higher order features : the ability to compare fetched documents with one another and refine the question

Non-RAG, aka multi-layer (non-causal) transformer with infinite context, is the more generic version, fully differentiable meaning you can use machine learning to learn how to Non-RAG better. Each layer of the transformer can use the previous layer to reason and refine the similarity search. (A causal transformer know the question at the time when it is feed the question, and can choose to focus it's attention on different part of the previously computed features of the provided documents but may benefit from having some reflection token, or better : be given the question before being presented the documents (provided you've trained it to answer it like that).)

RAG is an approximation of the generic case to make it faster and cheaper. Usually it breaks end-to-end differentiability by using external tools, so this mean that if you want to use machine learning to learn how to RAG better you will need to use some variant of Reinforcement Learning which is slower to learn things. RAG usually don't know the question at the time of index building, and documents are treated independently of each other, so no (automatic) higher order features (embeddings are fixed).

A third usual approximation, is to feed the output of RAG into Non-RAG, to hopefully get the best of both world. You can learn the Non-RAG given RAG with machine learning (if you train it with some conversations where it used RAG), but the RAG part won't improve by itself.

Non-RAG need to learn so it needs a big training dataset, but fortunately it can pick-up question answer pair in an unsupervised fashion when you feed it the whole web, and you only need a small instruction training and preference optimization dataset to shape it to your need. If performance isn't what you expect in a specific case, you can provide more specific examples and retrain the model until it gets it and you get better performance for the case you were interested in. You can improve the best case but it's hard to improve the worst case.

RAG has more control on what you feed it but content should be in a more structured way. You can prevent worst cases more easily but it's hard to improve good case.

edanm|4 months ago

> My mental model is that what is called RAG can either be:

RAG is confusing, because if you look at the words making up the acronym RAG, it seems like it could be either of the things you mentioned. But it originally referred to a specific technique of embeddings + vector search - this was the way it was used in the ML article that defined the term, and this is the way most people in the industry actually use the term.\

It annoys me, because I think it should refer to all techniques of augmenting, but in practice it's often not used that way.

There are reasons that specifically make the "embeddings" idea special - namely, it's a relatively new technique that actually fits LLM very well, because it's a semantic search - meaning, it works on "the same input" as LLMs do, which is a free-text query. (As opposed to a traditional lookups that work on keyword search or similar.)

As for whether RAG is dead - if you mean specifically vector-embeddings and semantic search, it's possible - because you could theoretically use other techniques for augmentation, e.g. an agent that understands a user question about a codebase and uses grep/find/etc to look for the information, or composes a search to search the internet for something. But it's definitely not going to die in that second sense of "we need some way to augment LLMs knowledge before text generation", that will probably always be relevant, as you say.

make3|4 months ago

no one is saying rag is dead, you're never going to put the whole Internet in the context of the model, & the more you put the more expensive it is.

zem|4 months ago

this was really weird to read:

> But RAG is a very real world, practical topic for something as significant as a new lab’s first paper.

I would expect exactly the opposite - that a new lab would put out a few random papers that happen to be in areas their researchers were interested in and already working on, and once people had been working together a while and developed some synergy they would maybe come out with something really groundbreaking.

do people really view a "first paper" as something deeply significant and weighty? because that just seems like a good way to get bogged down in trying to second guess whether any given paper was good enough to be your all-important debut!

Al-Khwarizmi|4 months ago

As an academic I would expect the same as you, and no, to my knowledge "first paper" is meaningless, at least in academia. Most people's first paper is some small contribution to what their PhD supervisor is doing at the time, where the student tries their best at writing but it ends up so heavily edited that probably 90% of the final text comes from the supervisor :) So typically first papers don't define or represent a researcher. When you start you just don't have the experience to have a great idea and carry it through to a good paper.

Of course here we are talking about a lab, not an individual person, but still I haven't heard of first papers being considered special in any way, even for labs.

elyobo|4 months ago

Can we have a more informative, less clickbaity, title?

dang|4 months ago

What would a more informative, less clickbaity title be?

(preferably using representative language from the article)

smeeger|4 months ago

there should be a guideline to get rid of clickbait titles. its an epidemic here

jongjong|4 months ago

Interesting. All developers I know who tinkered around with embeddings and vector similarity scoring were instantly hooked. The efficiency of computing the embeddings once and then reusing as many times as needed, comparing the vectors with a cheap <30-line function is extremely appealing. Not to mention the indexing capabilities to make it work at scale.

IMO vector embedding is the most important innovation in computing of the last decade. There's something magical about it. These people deserve some kind of prize. The idea that you can reduce almost any intricate concept including whole paragraphs to a fixed-size vector which encapsulates its meaning and proximity to other concepts across a large number of dimensions is pure genius.

_jayhack_|4 months ago

Vector embedding is not an invention of the last decade. Featurization in ML goes back to the 60s - even deep learning-based featurization is decades old at a minimum. Like everything else in ML this became much more useful with data and compute scale

liampulles|4 months ago

If you take the embedding for king, subtract the embedding for male, add the embedding for female, and lookup the closest embedding you get queen.

The fact that dot product addition can encode the concept of royalty and gender (among all other sorts) is kind of magic to me.

ekidd|4 months ago

Vector embeddings are slightly interesting because they come pre-trained with large amounts of data.

But similar ways to reduce huge numbers of dimensions to a much smaller set of "interesting" dimensions have been known for a long time.

Examples include principal component analysis/single value decomposition, which was the first big breakthrough in face recognition (in the early 90s), and also used in latent semantic indexing, the Netflix prize, and a large pile of other things. And the underlying technique was invented in 1901.

Dimensionality reduction is cool, and vector embedding is definitely an interesting way to do it (at significant computational cost).

CuriouslyC|4 months ago

Vector embeddings are so overhyped. They're decent as a secondary signal, but they're expensive to compute and fragile. BM25 based solutions are more robust and WAY lower latency, at the cost of some accuracy loss vs hybrid solutions. You can get the majority of the lift from hybrid solutions with ingest time semantic expansion/reverse hyde type input annotation with a sparse embedding BM25 at a fraction of the computational cost.

calf|4 months ago

The idea of reducing language to mere bits, in general, sounds like it would violate the Godel/Turing theorems about computability.

Imnimo|4 months ago

I'm curious whether this is work that was specifically begun under the "superintelligence" umbrella, or if it's just that the people who were working on it had been shifted to the Superintelligence team by the time they wrote the paper. I would guess the former?

mountainriver|4 months ago

This was a very obvious next step, I played around with implementing something similar at one point.

In general we need to make it simpler for LLMs to take in different forms of embeddings. At least frameworks that simplify it.

yalogin|4 months ago

I am not surprised because the culture at meta is not at all, even in the slightest, to focus on science for the sake of it. It’s actively actively purged out of you. The focus is on metrics and how the bottom line is impacted. So this is in line with that

georgeburdell|4 months ago

It’s not that simple. I worked at a supplier of Meta and they paid us large NREs to fund our exploratory work

rhetocj23|4 months ago

Yeah and this problem is near impossible to fix once it has infested into the culture of the firm.

alex1138|4 months ago

"People are using our service more!" turns out to be a horrible metric when they outright lie to you (x has sent you a message! - when no message exists)

nmca|4 months ago

This is not work by any of the high profile new hires, in case folks are confused.

puttycat|4 months ago

Seems very incremental and very far from the pompous 'superintelligence' goal.

antonvs|4 months ago

It’s unlikely that the existing LLM architecture will evolve into anything that resembles superintelligence any more than it does already.

Which means that modifications to the architecture, and combining it with other components and approaches, are the next likely step. This paper fits that.

btilly|4 months ago

If you can collapse "retrieve this complex chunk when it is needed" into a single token, what else can you put into a token?

"Send this through the math coprocessor." "Validate against the checklist." "Call out to an agent for X." "Recheck against input stream Y." And so on.

Retrieval augmentation is only one of many uses for this. If this winds up with better integration with agents, it is very possible that the whole is more than the sum of its parts.

lukev|4 months ago

Think about it this way; they are encoding whole "thoughts" or "ideas" as single tokens.

It's effectively a multimodal model, which handles "concept" tokens alongside "language" tokens and "image" tokens.

A really big conceptual step, actually, IMO.

naasking|4 months ago

A 30 fold improvement seems a tad more than incremental.

macleginn|4 months ago

So this looks essentially like continuous prompting (see prefix tuning) with RL-driven selection of what to present as tokens and what as continuous inputs (embeddings).

SknCode|4 months ago

I am not sure if I understand things correctly.

I came to believe the LLMs work with token embeddings. Is then the REFRAG only "something" in front of the LLM, and the decoder is the RL policy which expands only some token chunk embeddings into token embeddings feedable to LLM? Or the REFRAG needs you to 'tune' the LLM to be able to work with both token embeddings and token chunk embeddings?

armcat|4 months ago

I couldn't immediately see in their graphs/tables any comparison against simple lexical/statistical based context compression, such as candidate selection of chunks using TF-IDF, word overlap etc. For most of us in the industry we need to find these quick wins that give us equivalent performance to sending huge amount of information to the LLM, while compressing by 10x.

naasking|4 months ago

> the core insight here is actually: if embeddings are generated by layers within the LLM, it makes no sense to convert them back to natural language, just for another LLM to compress those tokens back to embeddings.

Doesn't this tie the two layers together in a way that they can't evolve separately?

asim|4 months ago

This was inevitable. You can't keep training LLMs and expect that's the answer to the evolution of AI. Yes it'll happen and we'll keep creating new more refined and bigger models but it's like DNA or something like the cortex of the brain. After that you need these systems that essentially "live" for years digesting information and develop a more refined way to process, store and retrieve the information. Compression of RAG was also inevitable. It's like the btree index of a database. The thing is, we're probably one or two iterations away from being good enough on the RAG pipeline and then we'll need to focus more on the other pieces of sensory input that need to be connected and processed at higher throughput. Right now it's not fast or efficient enough. This is where the likes of Google will shine. They are probably two decades ahead of everyone on internal technology and there is some team with the breakthrough but it hasn't seen the light of day yet. What's coming out of DeepMind is really a forced effort in productization and publication of work in a consumable format but internally they are likely way ahead. I don't have as much faith in Meta's efforts despite seeing things like this. Quite frankly those people, the ones doing the work should move to more honourable companies. Not feed crack addiction in the form of Meta's universe.

smeeger|4 months ago

exactly. the real focus internally is working on new architectures. there is no other possibility.

koolala|4 months ago

Did a "superintelligence" lab publish a superintelligence related paper with no results for intelligence? What measured improvements did this proposal make in their LLM's intelligence?

aurohacker|4 months ago

Figure 1 in the paper is all about the encoder and how the context and query is packaged and sent to the decoder. I wish it were more complete...

bigyabai|4 months ago

> Long awaited first paper from Meta Superintelligence Labs is not a model layer innovation. What does this mean?

It means you're reading into it too much and need to be let down, gently, from the hype train.

mikepalmer|4 months ago

I hate articles that don't define their acronyms! Lazy? Intentionally exclusive?

So that others don't also have to look it up, it's Retrieval-Augmented Generation (RAG).

They even say it's "a topic that we didn’t expect"... so... perhaps many people wouldn't have heard of it?

RataNova|4 months ago

Refreshing (and slightly unexpected) to see Meta Superintelligence start with something this practical instead of a headline-grabbing new model

singularity2001|4 months ago

somewhere in my hacker news comment history I presented this very idea

foldl2022|4 months ago

So, show me the model weights, please.

i5heu|4 months ago

Can we please get rid of the clickbait titles?

pppoe|4 months ago

I find it absurd that, compared to the past, large companies now have more abundant stock prices and cash than ever before, yet nearly every AI Lab in these companies is facing greater pressure than ever, being asked to generate short-term profits. In the midst of AI's unprecedented boom, the research environment and atmosphere in the industry seem to have worsened compared to the past.

signatoremo|4 months ago

Is this Meta’s lab pressured to generate short term profits?

Which other under pressure labs are you talking about?

sefrost|4 months ago

Is it because of the "winner takes all" and "lock-in effects" of being the first to market?

cm2012|4 months ago

At first I thought the super intelligence wrote a novel scientific paper

nine_k|4 months ago

A great post, it starts with this:

TL;DR

• MSI’s first paper, REFRAG, is about a new way to do RAG.

• This slightly modified LLM converts most retrieved document chunks into compact, LLM-aligned chunk embeddings that the LLM can consume directly.

• A lightweight policy (trained with RL) decides which chunk embeddings should be expanded back into full tokens under a budget; the LLM runs normally on this mixed input.

• The net effect is far less KV cache and attention cost, much faster first-byte latency and higher throughput, while preserving perplexity and task accuracy in benchmarks.

I wish more long posts followed this model of a scientific paper.

xvector|4 months ago

Working in big tech it's pretty wild to see how integral AI has become to our work internally, vs the public perception of it. People are NOT prepared.

terminalshort|4 months ago

1. Hyperbolic statement about LLM capabilities with no concrete examples

2. Wild claim that the companies that sell LLMs are actually downplaying their capabilities instead of hyping them

fishmicrowaver|4 months ago

Not prepared for what? Seems like the rest of the world is desperate to be shown the way to unlock something of value?

gdulli|4 months ago

Not everyone has given in to the crutch.