I think one important point is missing here: more data does not automatically lead to better LLMs. If you increase the amount of data tenfold, you might only achieve a slight improvement. We already see that simply adding more and more parameters for instance does not currently make models better. Instead, progress is coming from techniques like reasoning, grounding, post-training, and reinforcement learning, which are the main focus of improvement for state-of-the-art models in 2025.
williamtrask|5 months ago
If you get copies of the same data, it doesn't help. In a similar fashion, going from 100 TBs of data scraped from the internet to 200TBs of data scraped from the internet... does it tell you much more? Unclear.
But there are large categories of data which aren't represented at all in LLMs. Most of the world's data just isn't on the internet. AI for Health is perhaps the most obvious example.
joe_the_user|5 months ago
I have to note that taking the "bitter lesson" position as a claim that more data will result in better LLMs is a wild misinterpretation (or perhaps a "telephone version) of the original bitter lesson article, which say only that general, scalable algorithms do better than knowledge-carrying, problem-specific algorithms. And the last I heard it was the "scaling hypothesis" that hardly had consensus among those in the field.
CuriouslyC|5 months ago