top | item 46523677

(no title)

zwnow | 1 month ago

I also love how AI enthusiasts just ignore the issue of exhausted training data... You cant just magically create more training data. Also synthetic training data reduces the quality of models.

discuss

aspenmartin|1 month ago

Youre mixing up several concepts. Synthetic data works for coding because coding is a verifiable domain. You train via reinforcement learning to reward code generation behavior that passes detailed specs and meets other deseridata. It’s literally how things are done today and how progress gets made.

zwnow|1 month ago

Most code out there is a legacy security nightmare, surely its good to train on that.

TeMPOraL|1 month ago

They don't ignore it, they just know it's not an actual problem.

It saddens me to see AI detractors being stuck in 2022 and still thinking language models are just regurgitating bits of training data.

zwnow|1 month ago

You are thankfully wrong. I watch lots of talks on the topic from actual experts. New models are just old models with more tooling. Training data is exhausted and its a real issue.

puchatek|1 month ago

That's been my main argument for why LLMs might be at their zenith. But I recently started wondering whether all those codebases we expose to them are maybe good enough training data for the next generation. It's not high quality like accepted stackoverflow answers but it's working software for the most part.

jacquesm|1 month ago

If they'd be good enough you could rent them to put together closed source stuff you can hide behind a paywall, or maybe the AI owners would also own the paywall and rent you the software instead. The second that that is possible it will happen.