The only reason to believe that statement would be that training data is finite and cannot be meaningfully synthetically generated in a way that is useful to the LLM model.
If you can agree that there are certain things which can be qualitatively measured by deterministic logic (e.g. "does this build", "what is the cyclomatic complexity of this", "does this pass the unit tests", "what is the performance characteristic of this", "can this be proven to be susceptible to a XSS bug", ...), and you can see that there are ways to use this information for feedback into the models, then there's no reason to think that the available training data is finite and limited by unclean generated data.
There's several missing steps in that logic that would be difficult to (linguistically) prove with certainty, but I'm reasonably sure that your statement is false.
Synthetic data is doing wonders for models like Phi-4, and at least part of the dataset for DeepSeek-R1 came from their earlier models.
If you read the literature from the Phi-4 team it talks about synthetic data allowing better control over the training process. The upfront investment is greater but pays off over multiple generations of trained models - and doesn’t leave you with SolidGoldMagikarp ;)
The best models are not language models but multimodal. The grounding of language in video data, new model architectures, and larger models will improve the robustness.
That HeyGen video does not suck. It's actually kind of hard to even tell it's AI if you are only looking at it for a few seconds.
The interesting thing comparing a human's learning to an AI is that AI skills and knowledge can be copied basically infinitely, whereas a human is a one of a kind.
I imagine some parents are putting in efforts with the goal of raising the most productive member of society they possibly can. AI teams have somewhat similar goals for the models they are training.
We could see AI take control of the planet within the next four years in order to end WWIII. We should just hope that they keep lots of us around in giant people zoos.
I agree the multimodal stuff is amazing. I'm seriously impressed with the new Gemini 2.0 family of models and can't wait until the full multimodal capabilities are in general release.
In terms of the HeyGen vid, it's passable, but that was something I literally whipped up in 10 minutes. You can make ones that are much, much better if you invest in creating better training material. The voice and video model in this case only used the one 3-minute source video.
Funny you mention the "people zoo" thing. That's actually part of a sci-fi story I've been trying to write since I was in my teens. Roughed out here: https://youtu.be/2KLdaVs_ugw
grajaganDev|1 year ago
joshka|1 year ago
If you can agree that there are certain things which can be qualitatively measured by deterministic logic (e.g. "does this build", "what is the cyclomatic complexity of this", "does this pass the unit tests", "what is the performance characteristic of this", "can this be proven to be susceptible to a XSS bug", ...), and you can see that there are ways to use this information for feedback into the models, then there's no reason to think that the available training data is finite and limited by unclean generated data.
There's several missing steps in that logic that would be difficult to (linguistically) prove with certainty, but I'm reasonably sure that your statement is false.
cadamsdotcom|1 year ago
If you read the literature from the Phi-4 team it talks about synthetic data allowing better control over the training process. The upfront investment is greater but pays off over multiple generations of trained models - and doesn’t leave you with SolidGoldMagikarp ;)
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...
jasonhanley|1 year ago
Once humans learn enough, they are able to start coming up with and evaluating their own ideas.
This ability isn't 100% apparent with current public AI models, but I strongly suspect that this is happening behind the scenes.
Certainly researchers are already using AI extensively to improve AI, and that really has the potential to go exponential.
ilaksh|1 year ago
That HeyGen video does not suck. It's actually kind of hard to even tell it's AI if you are only looking at it for a few seconds.
The interesting thing comparing a human's learning to an AI is that AI skills and knowledge can be copied basically infinitely, whereas a human is a one of a kind.
I imagine some parents are putting in efforts with the goal of raising the most productive member of society they possibly can. AI teams have somewhat similar goals for the models they are training.
We could see AI take control of the planet within the next four years in order to end WWIII. We should just hope that they keep lots of us around in giant people zoos.
jasonhanley|1 year ago
In terms of the HeyGen vid, it's passable, but that was something I literally whipped up in 10 minutes. You can make ones that are much, much better if you invest in creating better training material. The voice and video model in this case only used the one 3-minute source video.
Funny you mention the "people zoo" thing. That's actually part of a sci-fi story I've been trying to write since I was in my teens. Roughed out here: https://youtu.be/2KLdaVs_ugw
unknown|1 year ago
[deleted]