'Model collapse'? An expert explains the rumours about an impending AI doom

[+] KolmogorovComp|1 year ago|reply

> For instance, researchers found a 16% drop in activity on the coding website StackOverflow one year after the release of ChatGPT. This suggests AI assistance may already be reducing person-to-person interactions in some online communities.

I don't get why this is a bad news. It meant that most likely these questions would have been duplicates and/or very easy, as ChatGPT could fix them.

My issue with SO is on the opposite side of the spectrum. By the time I am ready to post a question after having done my homeworks, I know it is either an unfixable bug or out of reach for most if not all users, and the probability of getting an answer is very low.

Most of the time I then answer my own questions a few months later.

[+] creesch|1 year ago|reply

The article actually does explain this in the first few paragraphs.

It is an issue because communities like StackOverflow are actually one of the sources of training material. LLM's are not capable of innovation, they can only do variations based on what is in their training material. People asking questions to other people about things does allow for actual new and fresh answer, where people might come up with solutions other people have not yet.

And yes, you are right, some of these answers would be duplicates and easy to answer. But only to a certain point, there are a lot of answers on StackOverflow that are technically correct but severely outdated by this point. They have been superseeded by more modern ways of doing things or the APIs to do so have changed in newer versions of frameworks.

With a human community you will have people leaving comments on older answers pointing this out and answering with the current answers on new questions. LLM's, again, are not capable of picking up this new information on their own. You can actually often see this in the answers you get from them, they are often ever so slightly dated already.

When the communities that feed them die they will have less learning material so at the very least they will stagnate.

[+] clausecker|1 year ago|reply

> Most of the time I then answer my own questions a few months later.

Thank you for doing that. This is very valuable.

[+] lifeisstillgood|1 year ago|reply

Agreed - those who get the most out of SO seem to use it as a personal blogging tool - answering questions as a trigger for writing out what they know.

I have always assumed that a good marketing ruse for FOSS would be to pay an intern to basically Q&A your entire docs into SO - if you do it right it would even build a better roadmap for your project.

[+] ajsnigrutin|1 year ago|reply

> . By the time I am ready to post a question after having done my homeworks, I know it is either an unfixable bug or out of reach for most if not all users, and the probability of getting an answer is very low.

And even then it gets marked as duplicate of something unrelated.

[+] ColinWright|1 year ago|reply

People flooding the 'net with LLM-generated crap are both eating their seed corn and poisoning the well.

Steel manufactured before 1944[0] can be incredibly valuable[1] because it hasn't been tainted by the nuclear tests/bombs fallout, and maybe fairly soon any archive internet material written pre-2010 will be considered equally valuable.

[0] https://en.wikipedia.org/wiki/Low-background_steel

[1] https://interestingengineering.com/science/what-is-pre-war-s...

[+] Y_Y|1 year ago|reply

Not to mention pissing in the pool and tragedying the commons!

(By the way, the radioactivity in the atmosphere that was poisoning the steel has pretty much vanished by now, so don't go investing everything in battleship reclamation in 2024.)

[+] madaxe_again|1 year ago|reply

I dunno - this seems like a problem ideally situated for adversarial networks to resolve. All you need is a smaller, dumber, validated model that can check and validate the output of the bigger, smarter model, and either use that to directly act as a filter between the user and the larger model, or use it to retrain the larger model until it isn’t dumb and wrong.

[+] seydor|1 year ago|reply

I guess peopoe thought the same when photocopies were invented. Unlike steel, language is a human product

[+] amelius|1 year ago|reply

Unicode should make a special character for "start AI-generated content".

Then we can hopefully filter out the crap before training new models.

[+] JKCalhoun|1 year ago|reply

> To train GPT-3, OpenAI needed over 650 billion English words of text – about 200x more than the entire English Wikipedia.

Since, I assume, humans don't need this much training (?) this area would seem to be ripe to explore — can you achieve similar training with a fraction of the data needed for GPT-3.

[+] mattnewton|1 year ago|reply

It’s an apples to oranges comparison- humans are getting massive amounts of data from their nervous system far richer than the token stream of Wikipedia. The model has to learn everything about the world solely from statistical patterns of text whereas humans are seeing, hearing and touching the world in full resolution. Humans also take an active role in seaking novel stimuli that they can learn from.

But that does I think shed light on why multi-modality is so promising; beyond additional use cases, it could make the models better at existing text tasks too when they train on all that video data.

And I think it also reinforces why everyone is so hyped for reinforcement-learning techniques like the self play that alpha zero used to come LLMs.

[+] antklan|1 year ago|reply

This can be replicated on a small scale by using two LLMs (which can be two instances of the same LLM). Start with a human prompt, then feed the answer of LLM-1 as the prompt to LLM-2, then feed that answer of LLM-2 to LLM-1 and so forth.

The answers soon converge to some boring, bland repetition that isn't even logical.

No intelligence here, which means that code emitting LLMs are just stealing human IP that they happened to have read.

[+] viraptor|1 year ago|reply

What do you think would happen if you did that with people who have no reason to talk, but are forced to give an answer each time? I'm sure at some point they'd just degrade to nonsense, but I don't think that says anything about intelligence. (Being stuck for a long time with a friend does occasionally degrade into nonsense and absurd injokes in my experience)

[+] JKCalhoun|1 year ago|reply

I too was trained stealing human "IP". It just took decades (perhaps is continuing still). Maybe slow, ongoing AI is the future.

[+] ndsipa_pomu|1 year ago|reply

> No intelligence here, which means that code emitting LLMs are just stealing human IP that they happened to have read.

Well, "stealing" is incorrect as they don't consume the original text - it's still there. "Plagiarising " would be more accurate.

[+] whoitwas|1 year ago|reply

I wouldn't call crawling public data stealing. LLMs are just lazy and bad like the people who implemented them.

[+] DrScientist|1 year ago|reply

In the human space there is also 'model collapse' as demonstrated by group-think, or the dark arts of marketing/PR to sway opinion in the large.

This is why the scientific method was developed.

If you want an AI that truly learns then it has to get up of the sofa, and test the ideas in the real world, rather than learning to parrot hearsay.

[+] tho234i2342334|1 year ago|reply

The "scientific method" doesn't fix anymore than praying to Isis does.

It's not the way science is done any way.

[+] openrisk|1 year ago|reply

We can speculate all day long how the current "AI" phenomenon will evolve but, alas, there isn't much solid on which to ground arguments.

In the oral phase of human communication "AI models" were residing within human brains: the origin of any stream of messages was literally in front of you, in flesh and blood.

In the print phase we have the first major decoupling. Whether in the form of a cuneiform tablet or a modern paperback, the provenance was no longer assured. We had to rely on "controlled" seals, trust the publishers and their distribution chains etc. Ultimately this worked because the relative difficulty of producing printed artifacts helped developed a legal/political apparatus to control the spread of the "fake" stuff.

Enter the digital communications era and we have the second major decoupling. The amount (and increasingly the apparent veracity) of generating human oriented messaging is no longer a limiting factor. This has no precedent. You can create now plausibly create a fake Wikipedia [1] by just running a model. The signal-to-noise of digitally exchanged messages can experience catastrophic collapse.

> A flood of synthetic content might not pose an existential threat to the progress of AI development, but it does threaten the digital public good of the (human) internet.

Indeed, this is the real risk. I don't care about idiot "AI" feeding on itself but I do care about destroying any basis of sane digital communication between humans.

Will we develop the legal/political apparatus to control the flood of algorithmic junk? Will it be remotely democratic? The best precedent of digital automation destroying communication channels and our apparent inability and/or unwillingness to do something about it is email spam [2]. Decades after it first appeared spam is still degrading our infosphere. The writing is on the wall.

[1] https://en.wikipedia.org/wiki/Wikipedia:List_of_hoaxes_on_Wi...

[1] https://en.wikipedia.org/wiki/Email_spam

[+] botro|1 year ago|reply

In the model testing I've conducted, I've seen that LLMs from competing companies including GPT-4o, Gemini Flash 1.5, Llama 3.1 and Phi-3 all converge on the exact same joke. For a test of creativity this was alarming. They all tell slight variations of the same joke about ladders.

I've posted about it here: https://news.ycombinator.com/item?id=41125309

[+] viraptor|1 year ago|reply

This could use a bit more nuance around training from AI. While the naive approaches will experience worse replies, there are many documented cases where the quality improves instead. Star, self-reflection, groups of agents, and likely others I don't know about all improve the results using only the same model's output.

[+] imtringued|1 year ago|reply

We don't need "more" data. We already have all the data we need in terms of quantity. We don't need more "synthetic" data for supervised learning. We need better training algorithms that go beyond minimizing token level loss.

[+] mathw|1 year ago|reply

Surely the only way you overcome this is to write a system which can actually generate something new, rather than regurgitate an incredibly complicated statistical reworking of things it got trained on.

I did note how the author said that training models on other models doesn't have the ethical implications of training on stolen human data - except it does, because where did the first model get its training set from? This is why we make it illegal not just to steal but also to handle stolen goods.

[+] lifeisstillgood|1 year ago|reply

>>> 98% of collected data was rejected.

I mean apart from wow ! 98% of the internet is shit (that’s higher than even I assumed) - how did they differentiate between good and shit? PageRank? Length? I can tell the difference between good writing and bad writing (and porn and erotica) but I have to read it using my brain based LLM - how did they do it ?

This leads to the whole open source, show us your training data thing

[+] creesch|1 year ago|reply

Very cheap human labor in Africa:

- https://time.com/6247678/openai-chatgpt-kenya-workers/

- https://www.theguardian.com/technology/2024/apr/16/techscape...

Another hidden aspect of the current LLM landscape.

[+] Mashimo|1 year ago|reply

The article says 90% of collected data. Not 98% of all internet.

> discard as much as 90% of the data they initially collect for training models.

[+] pjc50|1 year ago|reply

64 comments