top | item 44829036

(no title)

I remember reading that llm’s have consumed the internet text data, I seem to remember there is an open data set for that too. Potential other sources of data would be images (probably already consumed) videos, YouTube must have such a large set of data to consume, perhaps Facebook or Instagram private content

But even with these it does not feel like AGI, that seems like the fusion reactor 20 years away argument, but instead this is coming in 2 years, but they have not even got the core technology of how to build AGI

discuss

Scrounger|6 months ago

> I remember reading that llm’s have consumed the internet text data

Not just the internet text data, but most major LLM models have been trained on millions of pirated books via Libgen:

https://techcrunch.com/2025/01/09/mark-zuckerberg-gave-metas...

sharemywin|6 months ago

the big step was having it reason through math problems that weren't in the training data. even now with web search it doesn't need every article in the training data to do useful things with it.

Ferrus91|6 months ago

This is using think time compute and reinforcement learning. I think this is going to plateau even faster than the initial LLM scaling though.