As a machine learning researcher, I don't get why these are called world models.
Visually, they are stunning. But it's nowhere near physical. I mean look at that video with the girl and lion. The tail teleports between legs and then becomes attached to the girl instead of the tiger.
Just because the visuals are high quality doesn't mean it's a world model or has learned physics. I feel like we're conflating these things. I'm much happier to call something a world model if its visual quality is dogshit but it is consistent with its world. And I say its world because it doesn't need to be consistent with ours
I think the issue is that "world models" are poorly defined.
With this kind of image gen, you can sorta plan robot interactions, but its super slow. I need to find the paper that deepmind produced, but basically they took the current camera input, used a text prompt like "robot arm picks up the ball", the video generated the arm motion, then the robot arm moved as it did in the video.
The problem is that its not really a world model, its just image gen. Its not like the model outputs a simulation that you can interact with (without generating more video) Its not like it creates a bunch of rough geo that you can then run physics on (ie you imagine a setup, draw it out and then run calcs on it.)
There is lots of work on making splats editable and semantically labeled, but again thats not like you can run physics on them so simulation is still very expensive. Also the properties are dependent on running the "world model" rather than querying the output at a point in time
The input images are stunning, model's result is another disappointing trip to uncanny valley. But we feel Ok as long as the sequence doesn't horribly contradict the original image or sound. That is the world model.
You just have to extrapolate the improvements in consistency in image model from the last couple of years and apply it to these kinds of video models. When in a couple of years they can generate consistent videos of many physical phenomena such that they are nearly indistinguishably from reality, you'll se why they are called "world models".
The tail teleports and reattaches because that is the sort of thing that happens in this special AI world. Even though it looks like a bug, it's actually a physical process being modelled accurately.
I feel like there's a bit if a disconnect with the cool video demos demonstrated here and say, the type of world models someone like Yann Lecunn is talking about.
A proper world model like Jepa should be predicting in latent space where the representation of what is going on is highly abstract.
Video generation models by definition are either predicting in noise or pixel space (latent noise if the diffuser is diffusing in a variational encoders latent space)
It seems like what this lab is doing is quite vanilla, and I'm wondering if they are doing any sort of research in less demo sexy joint embedding predictive spaces.
There was a recent paper, LeJepa from LeCunn and a postdoc that actually fixes many of the mode distribution collapse issues with the Jepa embedding models I just mentioned.
I'm waiting on the startup or research group that gives us an unsexy world model. Instead of giving us 1080p video of supermodels camping, gives us a slideshow of something a 6 year old child would draw. That would be a more convincing demonstrator of an effective world model.
The reason they are called "world models" is because the internal representation of what they display represents a "world" instead of a video frame or image. The model needs to "understand" geometry and physics to output a video.
Just because there are errors in this doesn't mean it isn't significant. If a machine learning model understands how physical objects interact with each other that is very useful.
For a minute I was like (spoiler alert) « wow the creepy sci-fi theories from the DEVS tv show is taking place »… then I looked up the video and that’s just video generation at this point
I guess this might be a chance to plug the fact that Matrix came up with their own Metaverse thing (for lack of a better word) called Third Room, it represented the rooms you joined as spaces/worlds, they built some limited functionality demos before the funding dried up
Given the near-impossibility of predicting something as "simple" as a stock market due to its recursive nature, I'm not sure I see how it would be possible to simulate an infinitely more complicated "world"
I'm doing a metasim in full 3D with physics, I just keep seeing the limitations of the video format too much, but it is amazing when done right. The other biggest concern is licensing of output.
This looks interesting, but can someone explain to me how this is different from video generators using the previous frames as inputs to expand on the next frame?
See the demo on their homepage. Calling it a world simulator is a marketing gimmick. It's a worse video generator but you can interact with it in real time and direct the video a little bit. Next version of this thing will be worth looking, this one isnt.
Interesting. I imagine quite a few issues would seem to stem out of the inherent nature of generative AI, we even see several in these demos themselves. One particularly stood out to me, the one where the man is submerged, and for a good while bubbles come out quite consistently out of his mask, and then suddenly one of the bubbles turn into a jellyfish. At a specific frame, the AI thought it looked more like a jellyfish than a bubble and now the world has a jellyfish to deal with now.
It'll surely take a looot of video data, even more than what humans can possibly produce to build a normalized, euclidean, physics adherent world model. Data could be synthetically generated, checked thoroughly and fed to the training process, but at the end of the day it seems.... Wasteful. As if we're looking at a local optima point.
godelski|2 months ago
Visually, they are stunning. But it's nowhere near physical. I mean look at that video with the girl and lion. The tail teleports between legs and then becomes attached to the girl instead of the tiger.
Just because the visuals are high quality doesn't mean it's a world model or has learned physics. I feel like we're conflating these things. I'm much happier to call something a world model if its visual quality is dogshit but it is consistent with its world. And I say its world because it doesn't need to be consistent with ours
KaiserPro|2 months ago
With this kind of image gen, you can sorta plan robot interactions, but its super slow. I need to find the paper that deepmind produced, but basically they took the current camera input, used a text prompt like "robot arm picks up the ball", the video generated the arm motion, then the robot arm moved as it did in the video.
The problem is that its not really a world model, its just image gen. Its not like the model outputs a simulation that you can interact with (without generating more video) Its not like it creates a bunch of rough geo that you can then run physics on (ie you imagine a setup, draw it out and then run calcs on it.)
There is lots of work on making splats editable and semantically labeled, but again thats not like you can run physics on them so simulation is still very expensive. Also the properties are dependent on running the "world model" rather than querying the output at a point in time
nurettin|2 months ago
The input images are stunning, model's result is another disappointing trip to uncanny valley. But we feel Ok as long as the sequence doesn't horribly contradict the original image or sound. That is the world model.
andy12_|2 months ago
IAmGraydon|2 months ago
It's called "world models" because it's a grift. An out-in-the-open, shameless grift. Investors, pile on.
maplethorpe|2 months ago
LarsDu88|2 months ago
A proper world model like Jepa should be predicting in latent space where the representation of what is going on is highly abstract.
Video generation models by definition are either predicting in noise or pixel space (latent noise if the diffuser is diffusing in a variational encoders latent space)
It seems like what this lab is doing is quite vanilla, and I'm wondering if they are doing any sort of research in less demo sexy joint embedding predictive spaces.
There was a recent paper, LeJepa from LeCunn and a postdoc that actually fixes many of the mode distribution collapse issues with the Jepa embedding models I just mentioned.
I'm waiting on the startup or research group that gives us an unsexy world model. Instead of giving us 1080p video of supermodels camping, gives us a slideshow of something a 6 year old child would draw. That would be a more convincing demonstrator of an effective world model.
jstanley|2 months ago
I don't see that this follows "by definition" at all.
Just because your output is pixel values doesn't mean your internal world model is in pixel space.
blueblisters|2 months ago
superb_dev|2 months ago
I was expecting them to test a simple hypothesis and compare the model results to a real world test
ainiriand|2 months ago
nl|2 months ago
Just because there are errors in this doesn't mean it isn't significant. If a machine learning model understands how physical objects interact with each other that is very useful.
godelski|2 months ago
I'm unconvinced. The tiger and girl video is the clearest example. Nothing about that seems world representing
PunchyHamster|2 months ago
slashdave|2 months ago
No it doesn't. It merely needs to mimic.
rmnclmnt|2 months ago
qingcharles|2 months ago
nylonstrung|2 months ago
unknown|2 months ago
[deleted]
unknown|2 months ago
[deleted]
anigbrowl|2 months ago
01HNNWZ0MV43FF|2 months ago
zkmon|2 months ago
alex1138|2 months ago
mvkel|2 months ago
arminiusreturns|2 months ago
pedalpete|2 months ago
Is this more than recursive video? If so, how?
smusamashah|2 months ago
jyunth|2 months ago
It'll surely take a looot of video data, even more than what humans can possibly produce to build a normalized, euclidean, physics adherent world model. Data could be synthetically generated, checked thoroughly and fed to the training process, but at the end of the day it seems.... Wasteful. As if we're looking at a local optima point.
actionfromafar|2 months ago
no_no_no_no|2 months ago
[deleted]