I remember reading that llm’s have consumed the internet text data, I seem to remember there is an open data set for that too. Potential other sources of data would be images (probably already consumed) videos, YouTube must have such a large set of data to consume, perhaps Facebook or Instagram private contentBut even with these it does not feel like AGI, that seems like the fusion reactor 20 years away argument, but instead this is coming in 2 years, but they have not even got the core technology of how to build AGI
Scrounger|6 months ago
Not just the internet text data, but most major LLM models have been trained on millions of pirated books via Libgen:
https://techcrunch.com/2025/01/09/mark-zuckerberg-gave-metas...
sharemywin|6 months ago
Ferrus91|6 months ago