The Bitter Lesson Is Misunderstood

[+] kushalc|6 months ago|reply

Hey folks, OOP/original author and 20-year HN lurker here — a friend just told me about this and thought I'd chime in.

Reading through the comments, I think there's one key point that might be getting lost: this isn't really about whether scaling is "dead" (it's not), but rather how we continue to scale for language models at the current LM frontier — 4-8h METR tasks.

Someone commented below about verifiable rewards and IMO that's exactly it: if you can find a way to produce verifiable rewards about a target world, you can essentially produce unlimited amounts of data and (likely) scale past the current bottleneck. Then the question becomes, working backwards from the set of interesting 4-8h METR tasks, what worlds can we make verifiable rewards for and how do we scalably make them? [1]

Which is to say, it's not about more data in general, it's about the specific kind of data (or architecture) we need to break a specific bottleneck. For instance, real-world data is indeed verifiable and will be amazing for robotics, etc. but that frontier is further behind: there are some cool labs building foundational robotics models, but they're maybe ~5 years behind LMs today.

[1] There's another path with better design, e.g. CLIP that improves both architecture and data, but let's leave that aside for now.

[+] FloorEgg|6 months ago|reply

10+ years ago I expected we would get AI that would impact blue collar work long before AI that impacted white collar work. Not sure exactly where I got the impression, but I remember some "rising tide of AI" analogy and graphic that had artists and scientists positioned on the high ground.

Recently it doesn't seem to be playing out as such. The current best LLMs I find marvelously impressive (despite their flaws), and yet... where are all the awesome robots? Why can't I buy a robot that loads my dishwasher for me?

Last year this really started to bug me, and after digging into it with some friends I think we collectively realized something that may be a hint at the answer.

As far as we know, it took roughly 100M-1B years to evolve human level "embodiment" (evolve from single celled organisms to human), but it only took around ~100k-1M for humanity to evolve language, knowledge transfer and abstract reasoning.

So it makes me wonder, is embodiment (advanced robotics) 1000x harder than LLMs from an information processing perspective?

[+] Quarrelsome|6 months ago|reply

> if you can find a way to produce verifiable rewards about a target world

I feel like there's an interesting symmetry here between the pre and post LLM world, where I've always found that organisations over-optimise for things they can measure (e.g. balance sheets) and under-optimise for things they can't (e.g. developer productivity), which explains why its so hard to keep a software product up to date in an average org, as the natural pressure is to run it into the ground until a competitor suddenly displaces it.

So in a post LLM world, we have this gaping hole around things we either lack the data for, or as you say: lack the ability to produce verifiable rewards for. I wonder if similar patterns might play out as a consequence and what unmodelled, unrecorded, real-world things will be entirely ignored (perhaps to great detriment) because we simply lack a decent measure/verifiable-reward for it.

[+] w10-1|6 months ago|reply

> rather how we continue to scale for language models at the current LM frontier — 4-8h METR tasks

I wonder if this doesn't reify a particular business model, of creating a general model and then renting it out Saas-style (possibly adapted to largish customers).

It reminds me of the early excitement over mainframes, how their applications were limited by the rarity of access, and how vigorously those trained in those fine arts defended their superiority. They just couldn't compete with the hordes of smaller competitors getting into every niche.

It may instead be that customer data and use cases are both the most relevant and the most profitable. An AI that could adopt a small user model and track and apply user use cases would have entirely different structure, and would have demonstrable price/performance ratios.

This could mean if Apple or Google actually integrated AI into their devices, they could have a decisive advantage. Or perhaps there's a next generation of web applications that model use-cases and interactions. Indeed, Cursor and other IDE companies might have a leg up if they can drive towards modeling the context instead of just feeding it as intention to the generative LLM.

[+] simne|6 months ago|reply

> if you can find a way to produce verifiable rewards about a target world

I have significant experience on modelling physical world (mostly CFD, but also gamedev - with realistic rigid body collisions and friction).

I admit, exists domain (spectrum of parameters), where CFD and game physics working just well; exists predictable domain (on borders of well working domain), where CFD and game physics working good enough but could show strange things, and exists domain, where you will see lot of bugs.

And, current computing power is so much, that even on small business level (just median gamer desktop), we could save on more than 90% real-world tests with simulations in well working domain (and just avoid use cases in unreliable domains).

So I think, most question is just conservative bosses and investors, who don't believe to engineers and don't understand how to do checks (and tuning) of simulations with real world tests, and what reliable domain is.

[+] olq_plo|6 months ago|reply

Since you seem to know your stuff, why do LLMs need so much data anyway? Humans don't. Why can't we make models aware of their own uncertainty, e.g. feeding the variance of the next token distribution back into the model, as a foundation to guide their own learning. Maybe with that kind of signal, LLMs could develop 'curiosity' and 'rigorousness' and seek out the data that best refines them themselves. Let the AI make and test its own hypotheses, using formal mathematical systems, during training.

[+] mikewarot|6 months ago|reply

My focus lately is on the cost side of this. I believe strongly that it's possible to reduce the cost of compute for LLM type loads by 95% or more. Personally, it's been incredibly hard to get actual numbers for static and dynamic power in ASIC designs to be sure about this.

If I'm right (which I give a 50/50 odds to), and we can reduce the power of LLM computation by 95%, trillions can be saved in power bills, and we can break the need for Nvidia or other specialists, and get back to general purpose computation.

[+] JumpCrisscross|6 months ago|reply

> there are some cool labs building foundational robotics models, but they're maybe ~5 years behind LMs today

Wouldn't the Bitter Lesson be to invest in those models over trying to be clever about ekeing out a little more oomph from today's language models (and langue-based data)?

[+] unknown|6 months ago|reply

[deleted]

[+] amelius|6 months ago|reply

What do you mean by "verifiable rewards"?

Do you mean challenges for which the answer is known?

[+] eab-|6 months ago|reply

What do you mean about CLIP?

[+] godelski|6 months ago|reply

  > this isn't really about whether scaling is "dead"

I think there's a good position paper by Sara Hooker[0] that mentions some of this. Key point being that while the frontier is being pushed by big models with big data there's a very quiet revolution of models using far fewer parameters (still quite big) and data. Maybe "Scale Is All You Need"[1], but that doesn't mean it is practical or even a good approach. It's a shame these research paths have gotten a lot of pushback, especially given today's concerns about inference costs (this pushback still doesn't seem to be decreasing)

  > verifiable rewards

There's also a current conversation in the community over world models: is it actually a world model if the model does not recover /a physics/[2]. The argument for why they should recover a physics is that this means a counterfactual model must have been learned (no guarantees on if it is computationally irreducible). A counterfactual model gives far greater opportunities for robust generalization. In fact, you could even argue that the study of physics is the study of compression. In a sense, physics is the study of the computability of our universe[3]. Physics is counterfactual, allowing you to answer counterfactual questions like "What would the force have been if the mass had been 10x greater?" If this were not counterfactual we'd require different algorithms for different cases.

I'm in the recovery camp. Honestly I haven't heard a strong argument against it. Mostly "we just care that things work" which, frankly, isn't that the primary concern of all of us? I'm all for throwing shit at a wall and seeing what sticks, it can be a really efficient method sometimes (especially in early exploratory phases), but I doubt it is the most efficient way forward.

In my experience, having been a person who's created models that require magnitudes fewer resources for equivalent performance, I cannot stress enough the importance of quality over quantity. The tricky part is defining that quality.

[0] https://arxiv.org/abs/2407.05694

[1] Personally, I'm unconvinced. Despite success of our LLMs it's difficult to decouple other variables.

[2] The "a" is important here. There's not one physics per-say. There are different models. This is a level of metaphysics most people will not encounter and has many subtleties.

[3] I must stress that there's a huge difference between the universe being computable and the universe being a computation. The universe being computable does not mean we all live in a simulation.

[+] mdemare|6 months ago|reply

Just using common sense, if we had a genius, who had tremendous reasoning ability, total recall of memories, and an unlimited lifespan and patience, and he'd read what the current LLMs have read, we'd expect quite a bit more from him than what we're getting now from LLMs.

There are teenagers that win gold medals on the math olympiad - they've trained on < 1M tokens of math texts, never mind the 70T tokens that GPT5 appears to be trained on. A difference of eight orders of magnitude.

In other words, data scarcity is not a fundamental problem, just a problem for the current paradigm.

[+] FloorEgg|6 months ago|reply

The problem I am facing in my domain is that all of the data is human generated and riddled with human errors. I am not talking about typos in phone numbers, but rather fundamental errors in critical thinking, reasoning, semantic and pragmatic oversights, etc. all in long-form unstructured text. It's very much an LLM-domain problem, but converging on the existing data is like trying to converge on noise.

The opportunity in the market is the gap between what people have been doing and what they are trying to do, and I have developed very specialized approaches to narrow this gap in my niche, and so far customers are loving it.

I seriously doubt that the gap could ever be closed by throwing more data and compute at it. I imagine though that the outputs of my approach could be used to train a base model to close the gap at a lower unit cost, but I am skeptical that it would be economically worth while anytime soon.

[+] cs702|6 months ago|reply

I don't think Sutton's essay is misunderstood, but I agree with the OP's conclusion:

We're reaching scaling limits with transformers. The number of parameters in our largest transformers, N, is now in the order of trillions, which is the most we can apply given the total number of tokens of training data available worldwide, D, also in the order of trillions, resulting in a compute budget C = 6N × D, which is in the order of D². OpenAI and Google were the first to show these transformer "scaling laws." We cannot add more compute to a given compute budget C without increasing data D to maintain the relationship. As the OP puts it, if we want to increase the number of GPUs by 2x, we must also increase the number of parameters and training tokens by 1.41x, but... we've already run out of training tokens.

We must either (1) discover new architectures with different scaling laws, and/or (2) compute new synthetic data that can contribute to learning (akin to dreams).

[+] antirez|6 months ago|reply

This is true for the pre-training step. What if advancements in the reinforcement learning steps performed later may benefit from more compute and more models parameters? If right now the RL steps only help with sampling, that is, they only optimize the output of a given possible reply instead of the other (there are papers pointing at this: that if you generate many replies with just the common sampling methods, and you can verify correctness of the reply, then you discover that what RL helps with is selecting what was already potentially within the model output) this would be futile. But maybe advancements in the RL will do to LLMs what AlphaZero-alike models did with Chess/Go.

[+] charleshn|6 months ago|reply

> We cannot add more compute to a given compute budget C without increasing data D to maintain the relationship. > We must either (1) discover new architectures with different scaling laws, and/or (2) compute new synthetic data that can contribute to learning (akin to dreams).

Of course we can, this is a non issue.

See e.g. AlphaZero [0] that's 8 years old at this point, and any modern RL training using synthetic data, e.g. DeepSeek-R1-Zero [1].

[0] https://en.m.wikipedia.org/wiki/AlphaZero

[1] https://arxiv.org/abs/2501.12948

[+] FloorEgg|6 months ago|reply

What about or (3) models that interact with the real world?

To be clear I also agree with your (1) and (2).

[+] dosnem|6 months ago|reply

This seems so simple but I’m totally not understanding it..

If C = D^2, and you double compute, then 2C ==> 2D^2. How do you and the original author get 1.41D from 2D^2?

[+] aerospades|6 months ago|reply

I disagree with the author's thesis about data scarcity. There's an infinite amount of data available in the real world. The real world is how all generally intelligent humans have been trained. Currently, LLMs have just been trained on the derived shadows (as in Plato's allegory of the cave). The grounding to base reality seems like an important missing piece. The other data type missing is the feedback: more than passively training/consuming text (and images/video), being able to push on the chair and have it push back. Once the AI can more directly and recursively train on the real world, my guess is we'll see Sutton's bitter lesson proven out once again.

[+] credit_guy|6 months ago|reply

> There is no second internet

I don't know about that. LLMs have been trained mostly on text. If you add photos, audio and videos, and later even 3D games, or 3D videos, you get massively more data than the old plain text. Maybe by many orders of magnitude. And this is certainly that can improve cognition in general. Getting to AGI without audio and video, and 3D perception seems like a non-starter. And even if we think AGI is not the goal, further improvements from these new training datasets are certainly conceivable.

[+] nightsd01|6 months ago|reply

I am not an expert in AI by any means but I think I know enough about it to comment on one thing: there was an interesting paper not too long ago that showed if you train a randomly-initialized model from scratch on questions, like a bank of physics questions & answers, models will end up with much higher quality if you teach it the simple physics questions first, and then move up to more complex physics questions. This shows that in some ways, these large language models really do learn like we do.

I think the next steps will be more along this vain of thinking. Treating all training data the same is a mistake. Some data is significantly more valuable to developing an intelligent model than most other training data, even when you pass quality filters. I think we need to revisit how we 'train' these models in the first place, and come up with a more intelligent/interactive system of doing so

[+] a2128|6 months ago|reply

From my personal experience training models this is only true when the parameter count is a limiting factor. When the model is past a certain size, it doesn't really lead to much improvement to use curriculum learning. I believe most research also applies it only to small models (e.g. Phi)

[+] FloorEgg|6 months ago|reply

Wow. I really like this take. I've seen how time and time again nature follows the Pareto principle. It makes sense that training data would follow this principle as well.

Further that the order of training matters is novel to me and seems so obvious in hindsight.

Maybe both of these points are common knowledge/practice among current leading LLM builders. I don't build LLMs, I build on and with them, so I don't know.

[+] txrx0000|6 months ago|reply

This is precisely why chain of thought worked. Written thoughts in plain English is a much higher SNR encoding of the human brain's inner workings than random pages scraped from Amazon. We just want the model to recover the brain, not Amazon's frontend web framework.

[+] flux3125|6 months ago|reply

Curriculum learning: https://en.wikipedia.org/wiki/Curriculum_learning

[+] nikki93|6 months ago|reply

A relevant paper: https://arxiv.org/abs/2306.11644 -- the Phi models (and many others too) are based on this idea.

[+] simianwords|6 months ago|reply

I have never heard of order of training data matter in back propagation

[+] NooneAtAll3|6 months ago|reply

while I don't disagree with the facts, I don't understand the... tone?

when Dennard scaling (single core performance) started to fail in 90s-00s, I don't think there was a sentiment "how stupid was it to believe such a scaling at all"?

sure, people were compliant (and we still meme about running Crysis), but in the end the discussion resulted in "no more free lunch" - progress in one direction has hit a bottleneck, so it's time to choose some other direction to improve on (and multi-threading has now become mostly the norm)

I don't really see much of a difference?

[+] lawrencechen|6 months ago|reply

In the bitter lesson essay [0], the word "data" is not mentioned a single time.

The author fundamentally misunderstands the bitter lesson.

[0] https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...

[+] geetee|6 months ago|reply

I don't understand why we need more data for training. Assuming we've already digitized every book, magazine, research paper, newspaper, and other forms of media, why do we need this "second internet?" Legal issues aside, don't we already have the totality of human knowledge available to us for training?

[+] decimalenough|6 months ago|reply

The goal/theory behind the LLM investment explosion is that we can get to AGI by feeding them all the data. And to be clear, by AGI I don't mean "superhuman singularity", just "intelligent enough to replace most humans" (and, by extension, hoover up all the money we're spending on their salaries today).

But if we've already fed them all the data, and we don't have AGI (which we manifestly don't), then there's no way to get to AGI with LLMs and the tech/VC industry is about to have a massive, massive problem justifying all this investment.

[+] hansvm|6 months ago|reply

We don't have anything close to the totality of human knowledge digitized, much less in a form that LLMs can easily take advantage of. Even for easily verifiable facts powering modern industry, details like appropriate lube/speeds/etc for machining molybdenum for this or that purpose just don't exist outside of the minds of the few people who actually do it. Moreover, _most_ knowledge is similarly locked up inside a few people rather than being written down.

Even when written down, without the ability to interact with and probe the world like you did growing up it's not possible to meaningfully tell the difference between 9/11 hoaxers and everyone else save for how frequent the relative texts appear. They don't have the ability to meaningfully challenge their world model, and that makes the current breadth of written content even less useful than it might otherwise appear.

[+] dr_dshiv|6 months ago|reply

Let’s keep in mind that we don’t have most of the renaissance through the early modern period (1400-1800) because it was published in neolatin with older typefaces— and only about 10% is even digitized.

We probably don’t have most of the Arabic corpus either — and barely any Sanskrit. Classical Chinese is probably also lacking — only about 1% of it is translated to English.

[+] kbenson|6 months ago|reply

I interpreted it as a roundabout way of increasing quality. Take any given subreddit. You have posts and comments, and scores, but what if the data quality isn't very good overall? What if instead of using it as is, you instead had an AI evaluate and reason about all the posts, and classify them itself based on how useful the posts and comments are, how well they work out in practice (if easily simulated), etc? Essentially you're using the AI to provide a moderated and carefully curated set of information about the information that was already present. If you then ingest this information, does that increase the quality of the data? Probably(?), since you're throwing compute and AI reasoning at the problem ahead of time reducing compute and lowering the low quality data by adding additional high quality data.

[+] brazzy|6 months ago|reply

The point is that current methods are unable to get more than the current state-of-the-art models' degree of intelligence out of training on the totality of human knowledge. Previously, the amount of compute needed to process that much data was a limit, but not anymore.

So now, in order to progress further, we either have to improve the methods, or synthetically generate more training data, or both.

[+] unknown|6 months ago|reply

[deleted]

[+] j7ake|6 months ago|reply

The totality of human knowledge is a rounding error to what’s needed for AGI

[+] incompatible|6 months ago|reply

A lot of newspapers seem to be stuck behind paywalls, even when in the public domain.

[+] casey2|6 months ago|reply

It's a boot-strapping problem. LLMs have shown that we can reproduce data that's already in the form we want, and use that data to solve novel problems. There is no shortage of data, it's just data that's in a form you want is hard to come by. You want to create a model that generates steps for a robot with a particular shape? First you have to create a robot with that shape that can walk, then create a million of them and record them walking all over the place. Now you have something that's probably going to be too slow to run. Not fesible in the real world, the closest we have today is something like driverless car, (which is already a solved problem they are called trains)

This is why I think China will ultimately win the AI race, they will be able to put tens of millions of people to a specific task until there is enough data generated to replace humans on that task in 99.99% of cases, and they have the manufacturing capability to make the millions of IO devices needed for this.

Yes, humanoid robots are a good idea, but only if you can train them with walking data from real people, I think it will probably translate well enough to most humanoid robots, but ideally you are designing the physical robot from the ground up to model human movement as close as possible. You have to accept that if we go the LM route for AI that the optimal hardware behaves like human wetware. The neuromorphic computing people get it, robotics people should too.

[+] sfpotter|6 months ago|reply

Oh man, I love crazy stuff like this on HN. For a community which espouses rationality and careful thought, somehow an article with "C ~ D^2" has floated to the top. No notes.

[+] frankenstine|6 months ago|reply

> The path forward: data alchemists (high-variance, 300% lottery ticket) or model architects (20-30% steady gains)

No, the paths forward are: better design, training, feeding in more video, audio, and general data from the outside world. The web is just a small part of our experience. What about apps, webcam streams, radio from all over the world in its many forms, OTA TV, interacting with streaming content via remote, playing every video game, playing board games with humans, feeds and data from robots LLMs control, watching everyone via their phones and computers, car cameras, security footage and CCTV, live weather and atmospheric data, cable television, stereoscopic data, ViewMaster reels, realtime electrical input from various types of brains while interacting with their attached creatures, touch and smell, understanding birth, growth, disease, death, and all facets of life as an observer, observing those as a subject, expanding to other worlds, solar systems, galaxies, etc., affecting time and space, search and communication with a universal creator, and finally understanding birth and death of the universe.

[+] TheDong|6 months ago|reply

Is the data input into ChatGPT not a large enough source of new data to matter?

People are constantly inputting novel data, telling ChatGPT about mistakes it made and suggesting approaches to try, and so on.

For local tools, like claude code, it feels like there's an even bigger goldmine of data in that you can have a user ask claude code to do something, and when it fails they do it themselves... and then if only anthropic could slurp up the human-produced correct solution, that would be high quality training data.

I know paid claude-code doesn't slurp up local code, and my impression is paid ChatGPT also doesn't use input for training... but perhaps that's the next thing to compromise on in the quest for more data.

[+] madrox|6 months ago|reply

In any field where there is a creative element, progress comes in fits and starts that are difficult to predict in advance. No one can accurately predict when we'll get the cure for cancer, for example, in spite of people working on it.

But that isn't how investors operate. They want to know what they will get in exchange for giving a company a billion dollars. If you're running an AI business, you need to set expectations. How do you do that? Go do the thing you know you can do on a schedule, like standing up a new GPU data center.

I don't think the bitter lesson is misunderstood in quite the way the author describes. I think most are well aware we're approaching the data wall within a couple years. However, if you're not in academia you're not trying to solve that problem; you're trying to get your bag before it happens.

That may sound a little flip, but this is yet another incarnation of the hungry beast: https://stvp.stanford.edu/clips/the-hungry-beast-and-the-ugl...

[+] simianwords|6 months ago|reply

Why do you assume investors don’t know about this? They know some investments follow the power law - very few of them work out but they bring most value.

The very existence of openAI and Anthropic are proof of it happening.

Imagine you were an investor and you know what you know now (creativity can’t be predicted). How would you then invest in companies? Your answer might converge on existing VC strategies.

[+] d--b|6 months ago|reply

How does the brain do it?

A baby's brain isn't wired to the entire internet. A 2-year-old has access to at most 2 years of HD video data, plus some other belly-ache and poo-smell stimuli. And a baby's brain has no replay capacity.

That's not a lot to work with.

Yet, a 2-year-old clearly thinks, is conscious, can understand and create sentences, and wants to annihilate everything just as much as Grok.

Sure you can scale data all you want. But there should be enough to work with without scaling like crazy.

Having AI know all CSS tricks out there is one thing that requires a lot of data, AGI is different.

[+] benlivengood|6 months ago|reply

I don't think anyone has yet trained on all videos on the Internet. Plenty of petabytes left there to pretrain on, and likely just as useful once the text/audio/image pretraining is done.

[+] nobodywillobsrv|6 months ago|reply

They mention symmetries and invariances and what not but I wonder if it would be better to clearly emphasize that, in some problems, when you remove certain kinds of symmetries you are are combinatorially worse off. And this is ridiculously bad in some settings. Learning symmetries automatically used to be something I would see people working on but haven't kept up lately.

[+] TheDudeMan|6 months ago|reply

I interpret The Bitter Lesson as suggesting that you should be selecting methods that do not need all that data (in many domains, we don't know those methods yet).

[+] nahuel0x|6 months ago|reply

Also to consider, how the massive datasets powering LLMs were generated? For the case of text, it was generations of humans and humans lives, experiences and interactions with the real world that coagulated into masses of text and the language itself.. not to mention the evolutionary process that made that possible. There is an history of biological computation and interaction behind what it seems to be static data.

251 comments