How about humans in solitary, do they also go MAD, or the "brain in a vat" scenario. It is unreasonable to train a model on its own outputs for many iterations.
> with enough fresh real data, the quality and diversity of the generative models do not degrade over generations
When a model gets deployed and is prompted by people it can get fresh data in the prompt - the prompt itself, RAG material, outputs of tools, human responses to its outputs - and this allows for a form of exploration that can go outside the original scope of the model. If you use the model logs to retrain it won't collapse.
How does that work for image generation AI? And even with LLMs the prompts are going to contain extremely limited information because the users are primarily requesting information, not providing it.
AI is only as good as its models. Now that we're flooding the internet with AI generated content with no way to determine if the content was also produced by AI while endlessly shoveling as much content as possible into these AI models to push up those parameter counts to attract investors, it's inevitable the current generation of popular AI systems will just eat themselves.
Perhaps it means cultures that look inward and constantly recycle their own tropes are more likely to be unstable than cultures that look outwards and have a positive attitude to novelty.
Whether it's an AI eating its own training data, a human stuck in an empty cell and forced to live with just his/her own thoughts, or a culture believing its own myths and aggressively punishing insiders and outsiders who challenge them - it's all solitary confinement. And we know how that goes.
It's more like feeding humans to humans or cows to cows. They go mad too. Feeding cows to humans instead typically works, of course there are ALWAYS exceptions and it's never perfect or without harms. Both with AI and cows.
> If feeding AI generated stuff to train an AI is bad then what does that tell us about humans consuming AI generated stuff?
It's not unqualified bad. It is bad if you do it for many times in a loop, and without any external inputs. But you can put fresh material right in the prompt and it won't suffer from retraining on synthetic data.
If you train a small model from scratch on purely synthetic data, like the Microsoft Phi models, it comes out competent and 5x more efficient than models trained on web text. So there's the flip side. You can do it in bad way, and you can do it in a good way.
It reads to me that locking yourself / AI up in a room with only yourself to converse with causes it to go MAD / out of touch with reality.
If you interact with people / fresh data regularly you'll avoid that.
I'm not sure how much is down to evolving requirements and data formats vs degradation of the training data.
The idea sounds common-sense but it's also triggering a bullshit warning for me, because I'd expect more consistency per AI model, and differences to occur on new model interations - regardless of the dataset.
the problem isn't AI generated stuff, the problem is the models own AI generated stuff. If different models cover domains the current model doesn't do well in a particularly effectively, then there is plenty of merit for information to exchange between these models.
There is a difference to talking only to yourself and talking with other people.
You should better compare it with humans consuming stuff they created themselves. Or in a perhaps more sickening way, an author who only reads what they have written themself.
We know that there is more than one AI software but there is not as much variation as for humans.
And as someone else already wrote here, humans have something like a ratcheting mechanism. I would paraphrase it as "Humans have learnt to stand on shoulders of giants". AI does not have this.
Perhaps this is vaguely adjacent to a slow motion stream of self reflection, where in the absence of additional (sensory) input, a descent into hallucination is all but inevitable. The same thing happens in a sensory deprivation tank.
It's articles like this that makes me think it's plausible that we are a small N iterations away from stumbling into a transformer(/newfangled thing) architecture that rivals the reasoning capacity of humans.
Indeed. Such closed loop dynamics turns the AI into a sort of iterated function system, which gravitate to basins of attraction, or exhibit chaotic behaviour.
There's renewed excitement about synthetic data because of the Deepmind Geometry solving model that appeared yesterday - which was trained on synthetic data. I guess this is why this is being picked up again.
This is, among other things, a very natural consequence of some of the equations surrounding and involved in Shannon's original noisy channel capacity theorem, where the noise is (in many ways) conditioned upon the structure of the model itself.
It is not at all necessarily surprising, I think, from a purely high-level perspective, but I do personally think that I find that it is good to have the analysis. From a purely professional standpoint, I do not believe it is unique or distinctive enough as an individual method to need its own separate name for day-to-day use. From a personal perspective, however, I thought the mad cow disease reference was hilarious and applaud whoever came up with the acronym.
I find the benefit in the analysis, and the concerns presented about generated data being present in the data makes sense to me (and if in sufficient quantity, would make sense as biasing the models improperly in a rather significant kind of way).
I particularly enjoyed the humor of this line, the tongue-in-cheek nature is very funny/nice to me here:
"Ascertaining whether an autophagous loop has gone MAD or not (recall Definition 2.1) requires that we measure how far the synthesized data distribution Gt has drifted from the true data distribution Pr over the generations t."
I like their use of color in the paper, I saw a similar orange/green color scheme earlier today and enjoyed it very much as an annotation method.
"A fixed real dataset only slows generative model degradation" is again also a natural consequence of Shannon's noisy channel capacity theorem, one can say that with almost nearly perfect certainty that a limited neural network will not be able to perfectly fit the distribution of the data that it is training on, thus it will have bias, variance, or some combination of both, limited ultimately by the model's capacity itself.
This w.r.t. the original dataset is noise, and we can choose between whether we want collapse, or recursively encoding the noise patterns of the previous model (which might happen to have an additive effect, or maybe not! Who knows! I do not know for sure here, I have not yet figured this one out myself yet).
w.r.t. the real data slowing down degradation, if we are sampling I.I.D. of course then proportionately we still should see some degradation as this is the nature of empirical risk minimization over maximum likelihood estimation. It is still good that they have shown this, however, I thinks.
The fresh data loop, I believe, would be an example of actually a kind of noise in and of itself, w.r.t. the original input dataset, and as long as this 'noise' (from the perspective of the model) has a higher SNR than the (potentially slow) collapse of the model's output distribution, then it should (in some kind of proportion at leasts) be constantly-playing 'keep-up' with the fresh data.
"First, we find that—regardless of the performance of early generations—the performance of later generations converges to a point that depends only on the amounts of real and synthetic data in the training loop. " -- there we are (I saw this after making the SNR point, this makes sense within this framework of interpretation, then.
All in all, I found this paper very aware of itself and what it was studying, it was well-laid out and accessible, and while the points are not necessarily earth-shattering (though I still have to read through some of it, I think), having clear empirical evidence about this phenomenon, detailing it, and cutting away through the forest of (at-least-seemingly) untested battlegrounds is one that I appreciated.
Curious to hear what others think about this one. <3 :'))))
[+] [-] visarga|2 years ago|reply
> with enough fresh real data, the quality and diversity of the generative models do not degrade over generations
When a model gets deployed and is prompted by people it can get fresh data in the prompt - the prompt itself, RAG material, outputs of tools, human responses to its outputs - and this allows for a form of exploration that can go outside the original scope of the model. If you use the model logs to retrain it won't collapse.
[+] [-] weare138|2 years ago|reply
AI is only as good as its models. Now that we're flooding the internet with AI generated content with no way to determine if the content was also produced by AI while endlessly shoveling as much content as possible into these AI models to push up those parameter counts to attract investors, it's inevitable the current generation of popular AI systems will just eat themselves.
[+] [-] _nalply|2 years ago|reply
I find that catchy and descriptive on more than one level.
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] Almondsetat|2 years ago|reply
If I give a pig something to eat and he throws up I'm surely not touching that stuff myself
[+] [-] TheOtherHobbes|2 years ago|reply
Whether it's an AI eating its own training data, a human stuck in an empty cell and forced to live with just his/her own thoughts, or a culture believing its own myths and aggressively punishing insiders and outsiders who challenge them - it's all solitary confinement. And we know how that goes.
[+] [-] kvdveer|2 years ago|reply
Or to put it in your analogy, If I feed my cat milk, they'll throw up. That doesn't mean milk is unfit for human consumption.
[+] [-] yetihehe|2 years ago|reply
[+] [-] visarga|2 years ago|reply
It's not unqualified bad. It is bad if you do it for many times in a loop, and without any external inputs. But you can put fresh material right in the prompt and it won't suffer from retraining on synthetic data.
If you train a small model from scratch on purely synthetic data, like the Microsoft Phi models, it comes out competent and 5x more efficient than models trained on web text. So there's the flip side. You can do it in bad way, and you can do it in a good way.
[+] [-] red_admiral|2 years ago|reply
A pig (or AI) throwing up on an input should increase your priors that it may be bad for humans, but it does not prove the matter either way.
[+] [-] BartjeD|2 years ago|reply
If you interact with people / fresh data regularly you'll avoid that.
I'm not sure how much is down to evolving requirements and data formats vs degradation of the training data. The idea sounds common-sense but it's also triggering a bullshit warning for me, because I'd expect more consistency per AI model, and differences to occur on new model interations - regardless of the dataset.
[+] [-] Charon77|2 years ago|reply
Just because it doesn't work with AI now, doesn't mean it's something that wouldn't ever work.
[+] [-] Grimblewald|2 years ago|reply
There is a difference to talking only to yourself and talking with other people.
[+] [-] _nalply|2 years ago|reply
We know that there is more than one AI software but there is not as much variation as for humans.
And as someone else already wrote here, humans have something like a ratcheting mechanism. I would paraphrase it as "Humans have learnt to stand on shoulders of giants". AI does not have this.
[+] [-] pbhjpbhj|2 years ago|reply
[+] [-] K0balt|2 years ago|reply
[+] [-] fooker|2 years ago|reply
[+] [-] sinuhe69|2 years ago|reply
[+] [-] mrjin|2 years ago|reply
[+] [-] taylorius|2 years ago|reply
[+] [-] galaxyLogic|2 years ago|reply
Luckily we don't hear too much about Mad Cow Disease these days. Let's hope it doesn't come back.
[+] [-] tanepiper|2 years ago|reply
[+] [-] stcredzero|2 years ago|reply
[+] [-] Frummy|2 years ago|reply
[+] [-] kromem|2 years ago|reply
[+] [-] kromem|2 years ago|reply
[+] [-] sgt101|2 years ago|reply
[+] [-] red_admiral|2 years ago|reply
At least AI hasn't reinvented Mutually Assured Destruction!
[+] [-] tysam_and|2 years ago|reply
It is not at all necessarily surprising, I think, from a purely high-level perspective, but I do personally think that I find that it is good to have the analysis. From a purely professional standpoint, I do not believe it is unique or distinctive enough as an individual method to need its own separate name for day-to-day use. From a personal perspective, however, I thought the mad cow disease reference was hilarious and applaud whoever came up with the acronym.
I find the benefit in the analysis, and the concerns presented about generated data being present in the data makes sense to me (and if in sufficient quantity, would make sense as biasing the models improperly in a rather significant kind of way).
I particularly enjoyed the humor of this line, the tongue-in-cheek nature is very funny/nice to me here:
"Ascertaining whether an autophagous loop has gone MAD or not (recall Definition 2.1) requires that we measure how far the synthesized data distribution Gt has drifted from the true data distribution Pr over the generations t."
I like their use of color in the paper, I saw a similar orange/green color scheme earlier today and enjoyed it very much as an annotation method.
"A fixed real dataset only slows generative model degradation" is again also a natural consequence of Shannon's noisy channel capacity theorem, one can say that with almost nearly perfect certainty that a limited neural network will not be able to perfectly fit the distribution of the data that it is training on, thus it will have bias, variance, or some combination of both, limited ultimately by the model's capacity itself.
This w.r.t. the original dataset is noise, and we can choose between whether we want collapse, or recursively encoding the noise patterns of the previous model (which might happen to have an additive effect, or maybe not! Who knows! I do not know for sure here, I have not yet figured this one out myself yet).
w.r.t. the real data slowing down degradation, if we are sampling I.I.D. of course then proportionately we still should see some degradation as this is the nature of empirical risk minimization over maximum likelihood estimation. It is still good that they have shown this, however, I thinks.
The fresh data loop, I believe, would be an example of actually a kind of noise in and of itself, w.r.t. the original input dataset, and as long as this 'noise' (from the perspective of the model) has a higher SNR than the (potentially slow) collapse of the model's output distribution, then it should (in some kind of proportion at leasts) be constantly-playing 'keep-up' with the fresh data.
"First, we find that—regardless of the performance of early generations—the performance of later generations converges to a point that depends only on the amounts of real and synthetic data in the training loop. " -- there we are (I saw this after making the SNR point, this makes sense within this framework of interpretation, then.
All in all, I found this paper very aware of itself and what it was studying, it was well-laid out and accessible, and while the points are not necessarily earth-shattering (though I still have to read through some of it, I think), having clear empirical evidence about this phenomenon, detailing it, and cutting away through the forest of (at-least-seemingly) untested battlegrounds is one that I appreciated.
Curious to hear what others think about this one. <3 :'))))
[+] [-] galaxyLogic|2 years ago|reply
[+] [-] t_tsonev|2 years ago|reply
[+] [-] chewxy|2 years ago|reply
[+] [-] kromem|2 years ago|reply
[+] [-] Bayes7|2 years ago|reply