I also love how AI enthusiasts just ignore the issue of exhausted training data... You cant just magically create more training data. Also synthetic training data reduces the quality of models.
Youre mixing up several concepts. Synthetic data works for coding because coding is a verifiable domain. You train via reinforcement learning to reward code generation behavior that passes detailed specs and meets other deseridata. It’s literally how things are done today and how progress gets made.
You are thankfully wrong. I watch lots of talks on the topic from actual experts. New models are just old models with more tooling. Training data is exhausted and its a real issue.
That's been my main argument for why LLMs might be at their zenith. But I recently started wondering whether all those codebases we expose to them are maybe good enough training data for the next generation. It's not high quality like accepted stackoverflow answers but it's working software for the most part.
If they'd be good enough you could rent them to put together closed source stuff you can hide behind a paywall, or maybe the AI owners would also own the paywall and rent you the software instead. The second that that is possible it will happen.
aspenmartin|1 month ago
zwnow|1 month ago
TeMPOraL|1 month ago
It saddens me to see AI detractors being stuck in 2022 and still thinking language models are just regurgitating bits of training data.
zwnow|1 month ago
puchatek|1 month ago
jacquesm|1 month ago