Some perspectives from someone working in the image space.
These tests don't feel practical - That is, they seem intended to collapse the model, not demonstrate "in the wild" performance.
The assumption is that all content is black or white - AI or not AI - and that you treat all content as equally worth retraining on.
It offers no room for assumptions around data augmentation, human-guided quality discrimination, or anything else that might alter the set of outputs to mitigate the "poison"
As someone also working in the imaging space, ai generated data is useful solong as it's used carefully.
Specifically, we're implementing AI culled training sets which contain some generated data that then gets reviewed manually for a few specific things, then pushed into our normal training workflows. This makes for a huge speedup versus 100% manual culling and the metrics don't lie, the models continue to improve steadily.
There may be a point where they're poisoned and will collapse, but I haven't seen it yet.
This is exactly right. Model collapse does not exist in practice. In fact, LLMs trained on newer web scrapes have increased capabilities thanks to the generated output in their training data.
For example, "base" pretrained models trained on scrapes which include generated outputs can 0-shot instruction follow and score higher on reasoning benchmarks.
Intentionally produced synthetic training data takes this a step further. For SoTA LLMs the majority of, or all of, their training data is generated. Phi-2 and Claude 3 for example.
This is the part that I don't really understand. Isn't this basically an evolutionary algorithm, where the fitness function is "whatever people like the most" (or at least enough to post it online)?
People rarely generate 10 pieces of content with AI and then share all 10 with the world. They usually only share the best ones. This naturally filters for better output.
Are they saying that evolutionary algorithms don't work?
> Use the model to generate some AI output. Then use that output to train a new instance of the model and use the resulting output to train a third version, and so forth. With each iteration, errors build atop one another. The 10th model, prompted to write about historical English architecture, spews out gibberish about jackrabbits.
That this happens doesn't surprise me, but I'd love to see a curve of how each organic vs machine content mixe ratio results in model collapse over N generations.
I believe that this is a non-problem pushed forward by small-scale experiments that are not representative of what people actually do with AI generation.
A lot of new content, while AI generated, has been hand picked and polished by a human (for example, while you might commit AI generated code to your codebase, you ensure that it is correct and follows your preferred style).
Content farms will push gibberish out, but they did so, and worse, before and the first generation of models was able to train on the internet anyway.
i think it's pretty much a problem and it's going to ruin any chance of high quality, original content.
look at the original internet content and what seo has done to it. google and search in general results are trash nowadays. this is what genAI is going to do over long term. garbage in garbage out.
You'd think we'd be concerned about it poisoning the culture, well before any concerns that it would start to interfere with the rich continuing to be able to profit from it doing so.
I think it's interesting that human minds generally (though not always!) improve when exposed to the output of other human minds. It seems to be the opposite for current LLMs.
Maybe it's less about "Human VS Robot" and more about exposure to "Original thoughts VS mass-produced average thoughts".
I don't think a human mind would be improving if they're in a echo-chamber with no new information. I think the reason the human mind is improving is because we're exposed to new, original and/or different thoughts, that we hadn't considered or come across before.
Meanwhile, a LLM will just regurgitate the most likely token based on the previous one, so there isn't any originality there, hence any output from a LLM cannot improve another LLM. There is nothing new to be learned, basically.
humans haven’t been had the same set of all encompassing “training experiences” like LLMs have. we each a subset of knowledge that may overlap with some other’s knowledge, but is largely unique. so when we interact with each other we can learn new things, but with LLMs I imagine it is a group of experienced but antiquated professors developing their own set of out of touch ideas
A sequence of AI models trained on each other's output gets mutations, which might help or hurt, but if there's one dominant model at any given time then it's like asexual reproduction with only living descendant in each generation (and all the competing models being failures to reproduce). A photocopy of a photocopy of a photocopy — this seems to me to also be the incorrect model which Intelligent Design proponents seem to mistakenly think is how evolution is supposed to work.
A huge number of competing models that never rise to dominance would be more like plants spreading pollen in the wind.
A huge number of AI there are each smart enough to decide what to include in its training set would be more like animal reproduction. The fittest memes survive.
Memetic mode collapses still happen in individual AI (they still happen in humans, we're not magic), but that manifests as certain AI ceasing to be useful and others replacing them economically.
A few mega-minds is a memetic monoculture, fragile in all the same ways as a biological monoculture.
Have you ever heard of the telephone game? This is what is going on here. Or imagine an original story of something that really happened. If it goes by 100 people in a chain, how much do you think the story will resemble the original one?
I mean it makes sense that (even impressively functional) statistical approximations would degrade when recursed.
If anything I think this just demonstrates yet again that these aren't actually analogous to what humans think of as "minds", even if they're able to replicate more of the output than makes us comfortable.
Humans exhibit very similar behavior. Prolonged sensory deprivation can drive a single individual insane. Fully isolated/monolithic/connected communities easily become detached from reality and are susceptible to mass psychosis. Etc etc etc. Humans need some minimum amount of external data to keep them in check as well.
I watched someone in the printer room at the computer science department gradually photocopy from white to black, and back again, over the span of 300 pieces of paper, by altering the thresholds of the photocopyer.
They didn’t graduate to become computer scientists, but did indeed get admitted to the royal school of art the year after.
It's fascinating that error can accumulate through repeated trainings that 1) is undetected by humans and 2) can degrade LLM or diffusion models (or any transformer model?) so completely. This implies that not only do we not understand how latent knowledge is actually representated in deep nets, we don't know it forms or how it changes during training. If we did, we could have predicted the destructive impact of recycling of output as input. IMO, this suggests we should demand rigorous validation of deep nets (especially generative ones) before relying on them to behave responsibly.
The effect is not new. We have known about it ever since we've had basic machine learning. The way to look at it is somewhat novel but not surprising at all.
I think AI-generated images are worse for training AI generative models than LLMs, since there are so many now on the internet (see Instagram art related hashtags if you want to see nothing but AI art) compared to the quantity of images downloaded prior to 2021 (for those AI that did that). Text will always be more varied than seeing 10m versions of the same ideas that people make for fun. AI text can also be partial (like AI-assisted writing) but the images will all be essentially 100% generated.
That's far from unique to instagram. I loathe Stable Diffiusion and co solely because they've utterly FLOODED every cool art-adjacent website with endless mediocre derivative shit. Like there was always low-effort content of course, but holy fuck, there is SO MUCH MORE now. And some of these people are trying to CHARGE for this uninspired junk!!!
There's no such thing as a world model. People do not have world models.
This is a confused term made up by 70s AI researchers, who had the continual problem that they didn't know any philosophy and kept making up their own metaphors for how intelligence might work, and then deciding that because they'd made it up it must be true, and also that if they wrote a computer program that had the same metaphors it must work.
"World model" just vaguely points at something people might do and assumes that if you make up a new thing it vaguely points at it'd help.
I also wonder what search engines are going to do about all this. Sounds to me, actually, traditional, non-intelligent search might be on its way out, although of course it'll take time. Future search engines will have to be quite adept at trying to figure out whether the text they index is bullshit or not.
reminds me of sheep and cows being fed their bretherens own brain matter developing spongiform encepalopathy (brain disease) or of course cannibals developing kuru. except a purely 'software' form.
Is there a standard objective metric that can help determine that the quality of a model has degraded over time. In that case, much like source code, you just revert to the old version.
I don't remember wich YouTuber made a interesting video about it but basically communities are moving away from the free web in private communities (think discord or even sites that you are forced to register to to read the content)
It's an interesting thing but I think queries on searche engines are becoming worse for this reason too.
I'm not sure how much of a risk this is to LLMs in particular, but I feel like we're already seeing the impact on image AI models.
Even though they're getting better at generating hands that make sense and other fine details, you can generally tell that an image is AI generated because it has a certain "style". Can't help but wonder if this is partly due to generated images contaminating the training data and causing subsequent AI image generators to stylistically converge over time.
It's because the models don't have an optimal aesthetic policy. Which would be difficult, but if they did have one, it wouldn't matter how much bad input data you added during pretraining.
In human society, a feedback loop of nonsense is usually defeated by practical effects in physical reality and experience. The objective of education, for example, is to transmit knowledge and apply reason to important questions.
In manipulated social media, there is no check on the nonsense loop. The technology that we currently call A.I. could be used for educational good.
How it will be used, however, is likely to further distort discourse and generate nonsense.
It is worse, because it is faster - how many incorrect blog articles can a sigle typical writer publish and post on the internet - maybe 1-2 a day if you are a prolific writer?
How many can an AI agent do? Probably hundreds of thousands a day. To me, that is going to be a huge problem - but don't have a solution in mind either.
And then those 100K bad articles posted per day by one person, are used as training data for the next 100K bad/incorrect articles etc - and the problem explodes geometrically.
Imagine you have a calculator that outputs a result that is off by one percent. That's ai right now.
If you use the results of each calculation in additional calculations, the result will skew further and further from reality with each error. That's ai training on itself.
sophrocyne|2 years ago
These tests don't feel practical - That is, they seem intended to collapse the model, not demonstrate "in the wild" performance.
The assumption is that all content is black or white - AI or not AI - and that you treat all content as equally worth retraining on.
It offers no room for assumptions around data augmentation, human-guided quality discrimination, or anything else that might alter the set of outputs to mitigate the "poison"
jtriangle|2 years ago
Specifically, we're implementing AI culled training sets which contain some generated data that then gets reviewed manually for a few specific things, then pushed into our normal training workflows. This makes for a huge speedup versus 100% manual culling and the metrics don't lie, the models continue to improve steadily.
There may be a point where they're poisoned and will collapse, but I haven't seen it yet.
MacsHeadroom|2 years ago
For example, "base" pretrained models trained on scrapes which include generated outputs can 0-shot instruction follow and score higher on reasoning benchmarks.
Intentionally produced synthetic training data takes this a step further. For SoTA LLMs the majority of, or all of, their training data is generated. Phi-2 and Claude 3 for example.
Aerroon|2 years ago
This is the part that I don't really understand. Isn't this basically an evolutionary algorithm, where the fitness function is "whatever people like the most" (or at least enough to post it online)?
People rarely generate 10 pieces of content with AI and then share all 10 with the world. They usually only share the best ones. This naturally filters for better output.
Are they saying that evolutionary algorithms don't work?
paulddraper|2 years ago
Precisely.
Whether content is AI-generate, ghostwriter-generated, monkey-on-keyboard-generated, etc...presumably it is implictly filtered by value/quality.
Garbage AI outputs won't be as popular as good AI outputs. (And the same is true of human ones!)
data-ottawa|2 years ago
That this happens doesn't surprise me, but I'd love to see a curve of how each organic vs machine content mixe ratio results in model collapse over N generations.
nestorD|2 years ago
x86x87|2 years ago
look at the original internet content and what seo has done to it. google and search in general results are trash nowadays. this is what genAI is going to do over long term. garbage in garbage out.
add-sub-mul-div|2 years ago
beeboobaa|2 years ago
astrange|2 years ago
theferalrobot|2 years ago
pests|2 years ago
If they are scamming and you contact them, of course they will lie.
So how does this work?
buo|2 years ago
diggan|2 years ago
I don't think a human mind would be improving if they're in a echo-chamber with no new information. I think the reason the human mind is improving is because we're exposed to new, original and/or different thoughts, that we hadn't considered or come across before.
Meanwhile, a LLM will just regurgitate the most likely token based on the previous one, so there isn't any originality there, hence any output from a LLM cannot improve another LLM. There is nothing new to be learned, basically.
ausbah|2 years ago
ben_w|2 years ago
A sequence of AI models trained on each other's output gets mutations, which might help or hurt, but if there's one dominant model at any given time then it's like asexual reproduction with only living descendant in each generation (and all the competing models being failures to reproduce). A photocopy of a photocopy of a photocopy — this seems to me to also be the incorrect model which Intelligent Design proponents seem to mistakenly think is how evolution is supposed to work.
A huge number of competing models that never rise to dominance would be more like plants spreading pollen in the wind.
A huge number of AI there are each smart enough to decide what to include in its training set would be more like animal reproduction. The fittest memes survive.
Memetic mode collapses still happen in individual AI (they still happen in humans, we're not magic), but that manifests as certain AI ceasing to be useful and others replacing them economically.
A few mega-minds is a memetic monoculture, fragile in all the same ways as a biological monoculture.
NortySpock|2 years ago
mewpmewp2|2 years ago
KolmogorovComp|2 years ago
While some persons can strive in these kind of environment (think Kant for example), many would become crazy.
throwaway74432|2 years ago
analog31|2 years ago
JohnFen|2 years ago
BobaFloutist|2 years ago
If anything I think this just demonstrates yet again that these aren't actually analogous to what humans think of as "minds", even if they're able to replicate more of the output than makes us comfortable.
orbital-decay|2 years ago
ipython|2 years ago
sshine|2 years ago
They didn’t graduate to become computer scientists, but did indeed get admitted to the royal school of art the year after.
I found it strangely therapeutic.
randcraw|2 years ago
x86x87|2 years ago
coldcode|2 years ago
ToucanLoucan|2 years ago
esafak|2 years ago
astrange|2 years ago
This is a confused term made up by 70s AI researchers, who had the continual problem that they didn't know any philosophy and kept making up their own metaphors for how intelligence might work, and then deciding that because they'd made it up it must be true, and also that if they wrote a computer program that had the same metaphors it must work.
"World model" just vaguely points at something people might do and assumes that if you make up a new thing it vaguely points at it'd help.
ein0p|2 years ago
doubloon|2 years ago
p5v|2 years ago
GaggiX|2 years ago
Iulioh|2 years ago
I don't remember wich YouTuber made a interesting video about it but basically communities are moving away from the free web in private communities (think discord or even sites that you are forced to register to to read the content)
It's an interesting thing but I think queries on searche engines are becoming worse for this reason too.
hackerlight|2 years ago
Der_Einzige|2 years ago
Bjorkbat|2 years ago
Even though they're getting better at generating hands that make sense and other fine details, you can generally tell that an image is AI generated because it has a certain "style". Can't help but wonder if this is partly due to generated images contaminating the training data and causing subsequent AI image generators to stylistically converge over time.
astrange|2 years ago
RecycledEle|2 years ago
If you want foom (fast self-improvement in AI) use AIs to filter the training data for the next generation of AIs.
cortesoft|2 years ago
feoren|2 years ago
heresie-dabord|2 years ago
This is a crucial question.
In human society, a feedback loop of nonsense is usually defeated by practical effects in physical reality and experience. The objective of education, for example, is to transmit knowledge and apply reason to important questions.
In manipulated social media, there is no check on the nonsense loop. The technology that we currently call A.I. could be used for educational good.
How it will be used, however, is likely to further distort discourse and generate nonsense.
ejb999|2 years ago
How many can an AI agent do? Probably hundreds of thousands a day. To me, that is going to be a huge problem - but don't have a solution in mind either.
And then those 100K bad articles posted per day by one person, are used as training data for the next 100K bad/incorrect articles etc - and the problem explodes geometrically.
Libcat99|2 years ago
If you use the results of each calculation in additional calculations, the result will skew further and further from reality with each error. That's ai training on itself.
unknown|2 years ago
[deleted]
ur-whale|2 years ago
Looks like we didn't learn anything from the mad cow disease!
jxdxbx|2 years ago
richk449|2 years ago
chmike|2 years ago
hermitcrab|2 years ago