(no title)
msp26 | 1 month ago
Model collapse is a meme that assumes zero agency on the part of the researchers.
I'm unsure how you can have this conclusion when trying any of the new models. In the frontier size bracket we have models like Opus 4.5 that are significantly better at writing code and using tools independently. In the mid tier Gemini 3.0 flash is absurdly good and is crushing the previous baseline for some of my (visual) data extraction projects. And small models are much better overall than they used to be.
Ifkaluva|1 month ago
It goes further than just preventing poison—they do lots of testing on the dataset to find the incremental data that produces best improvements on model performance, and even train proxy models that predict whether data will improve performance or not. “Data Quality” is usually a huge division with a big budget.
stonogo|1 month ago
NewsaHackO|1 month ago
soulofmischief|1 month ago
ACCount37|1 month ago
So far, every serious inquiry into "does AI contamination in real world scraped data hurt the AI performance" has resulted in things like: "nope", "if it does it's below measurement error" and "seems to help actually?"
biophysboy|1 month ago
jbstack|1 month ago
mrtesthah|1 month ago
https://arxiv.org/abs/2501.12948
conartist6|1 month ago