top | item 46578431

(no title)

msp26 | 1 month ago

> because there's already concern that AI models are getting worse. The models are being fed on their own AI slop and synthetic data in an error-magnifying doom-loop known as "model collapse."

Model collapse is a meme that assumes zero agency on the part of the researchers.

I'm unsure how you can have this conclusion when trying any of the new models. In the frontier size bracket we have models like Opus 4.5 that are significantly better at writing code and using tools independently. In the mid tier Gemini 3.0 flash is absurdly good and is crushing the previous baseline for some of my (visual) data extraction projects. And small models are much better overall than they used to be.

discuss

Ifkaluva|1 month ago

The big labs spend a ton of effort on dataset curation.

It goes further than just preventing poison—they do lots of testing on the dataset to find the incremental data that produces best improvements on model performance, and even train proxy models that predict whether data will improve performance or not. “Data Quality” is usually a huge division with a big budget.

stonogo|1 month ago

The common thread from all the frontier orgs is that the datasets are too big to vet, and they're spending lots of money on lobbying to ensure they don't get punished for that. In short, the current corporate stance seems to be that they have zero agency, so which is it?

NewsaHackO|1 month ago

Huh? Unless you are talking about DMCA, I haven't heard about that at all. Most AI companies go to great lengths to prevent exfiltration of copyrighted material.

soulofmischief|1 month ago

Even if it's a meme for the general public, actual ML researchers do have to document, understand and discuss the concept of model collapse in order to avoid it.

ACCount37|1 month ago

It's a meme even if you assume zero agency on the part of the researchers.

So far, every serious inquiry into "does AI contamination in real world scraped data hurt the AI performance" has resulted in things like: "nope", "if it does it's below measurement error" and "seems to help actually?"

biophysboy|1 month ago

Yes, this particular threat seems silly to me. Isn't it a standard thing to rollback databases? If the database gets worse, roll it back and change your data ingestion approach.

jbstack|1 month ago

If you need a strategy to mitigate it (roll back and change approach) then it isn't really fair to describe it as "silly". If it's silly you could just ignore it altogether.

mrtesthah|1 month ago

Coding and reasoning skills can be improved using machine-driven reinforcement learning.

https://arxiv.org/abs/2501.12948

conartist6|1 month ago

Well, they seem to have 0 agency. They left child pornography in the training sets. The people gathering the data committed enormous crimes, wantonly. Science is disintegrating along with public trust in science as fake papers peer reviewed by fake peer reviewers slop along. And from what I hear there has been no more training on the open internet anymore in recent years as it's simply too toxic.