Survey of 65+ papers on model collapse. Key finding from Dohmatob et al. (ICLR 2025): even 0.1% synthetic contamination in training data causes measurable degradation.
No major dataset (FineWeb, RedPajama, C4) currently filters for AI-generated content.
No comments yet.