top | item 41826402

Diffusion for World Modeling

476 points| francoisfleuret | 1 year ago |diamond-wm.github.io | reply

218 comments

order
[+] smusamashah|1 year ago|reply
This video https://x.com/Sentdex/status/1845146540555243615 looks way too much like my dreams. This is almost exactly that happens when I sometimes try to jump high, it transforms me to a different place just like that. Things keep changing just like that. It's amazing to see how close it is to a real dream experience.
[+] kleene_op|1 year ago|reply
I noticed that all text looked garbled up when I had some lucid dreams. When diffusion models started to gain attention, I made the connection that text generated in generated images also looked garbled up.

Maybe all of those are clues that parts of the human subconscious mind operate pretty close to the principles behind diffusion models.

[+] siavosh|1 year ago|reply
What’s amazing is that if you really start paying attention it seems like the mind is often doing the same thing when you’re awake, less noticeable with your visual field but more noticeable with attention and thoughts themselves.
[+] voidUpdate|1 year ago|reply
Its interesting how much dreams differ from person to person. Mine tend to be completely coherant visually, to the point that I have used google maps in my dreams, and while the geography was inaccurate, it was consistent. However, I have never been lucid within a dream, maybe that makes a difference
[+] jvanderbot|1 year ago|reply
This is why I'm excited in a limited way. Clearly something is disconnected in a dream state that has an analogous disconnect here.

I think these models lack a world model, something with strong spatial reasoning and continuity expectations that animals have.

Of course that's probably learned too.

[+] earnesti|1 year ago|reply
That looks way too much to the one time I did DMT-5
[+] soheil|1 year ago|reply
How are you so sure this is like your dreams? If it was easy to accurately remember dreams why would they be all so smooshy and such a jumbled mess like in this video?
[+] thegabriele|1 year ago|reply
We are unconsciously (pun intended) implementing how brains work both in dream and wake states. Can't wait until we add some kind of (lossless) memory to this models.
[+] francoisfleuret|1 year ago|reply
This is 300M parameters model (1/1300th of the big llama-3) trained with 5M frames with 12 days of a GTX4090.

This is what a big tech company was doing in 2015.

The same stuff at industrial scale à la large LLMs would be absolutely mind blowing.

[+] gjulianm|1 year ago|reply
What exactly would be the benefit of that? We already have Counter Strike working far more smooth than this, without wasting tons of compute.
[+] GaggiX|1 year ago|reply
If 12 days with an RTX4090 is all you need, some random people on the Internet will soon start training their own.
[+] cs702|1 year ago|reply
Came here to say pretty much the same thing, and saw your comment.

The rate of progress has been mind-blowing indeed.

We sure live in interesting times!

[+] Sardtok|1 year ago|reply
Two 4090s, but yeah.
[+] marcyb5st|1 year ago|reply
So, this is pretty exciting.

I can how this can already be used to generate realistic physics approximations in a game engine. You create a bunch of snippets of gameplay using a much heavier and realistic physics engine (perhaps even CGI). The model learn to approximate the physics and boom, now you have a lightweight physics engine. Perhaps you can even have several that are specialized (e.g. one for smoke dynamics, one for explosions, ...). Even if it allucinates, wouldn't be worse than the physics bugs that are so common in games.

[+] croo|1 year ago|reply
For anyone who actually tried it :

Does it respects/builds some kind of game map in the process or is it just a bizarre psychedelic dream walk experience where you cannot go back the same place twice and space dimensions are just funny? Is a game map finite?

[+] InsideOutSanta|1 year ago|reply
Just looking at the first video, there's a section where structures just suddenly appear in front of the player, so this does not appear to build any kind of map, or have any kind of meaningful awareness of something resembling a game state.

This is similar to LLM-based RPGs I've played, where you can pick up a sword and put it in your empty bag, and then pull out a loaf of bread and eat it.

[+] aidos|1 year ago|reply
Just skimmed the article but my guess is that it’s a dream type experience where if you turned around 180 and walked the other direction it wouldn’t correspond to where you just came from. More like an infinite map.
[+] delusional|1 year ago|reply
Just tried it out, and no. It doesn't have any sort of "map" awareness. It's very much in the "recall/replay" category of "AI" where it seems to accurately recall stuff that is part of the training dataset, but as soon as you do something not in there (like walk into a wall), it completely freaks out and spits out gibberish. Plausible gibberish, but gibberish none the less.
[+] jmchambers|1 year ago|reply
I _think_ I understand the basic premise behind stable diffusion, i.e., reverse the denoising process to generate realistic images but, as far as I know, this is always done at the pixel level. Is there any research attempting to do this at the 3D asset level, i.e., subbing in game engine assets (with position and orientation) until a plausible scene is recreated? If it were possible to do it that way, couldn't it "dream" up real maps, with real physics, and so avoid the somewhat noisy output these types of demo generate?
[+] desdenova|1 year ago|reply
I think the closest we have right now is 3D gaussian splatting.

So far it's only been used to train a scene from photographs from multiple angles and rebuild it volumetrically by adjusting densities in a point-cloud.

But it might be possible to train a model on multiple different scenes, and perform diffusion on a random point cloud to generate new scenes.

Rendering a point cloud in real time is also very efficient, so it could be used to create insanely realistic game worlds instead of polygonal geometry.

It seems someone already thought of that: https://ar5iv.labs.arxiv.org/html/2311.11221

[+] furyofantares|1 year ago|reply
> but, as far as I know, this is always done at the pixel level

Image models are NOT denoised at the pixel level - diffusion happens in latent space. This was one of the big breakthroughs that made all of this work well.

There's a model for encoding/decoding between pixels and latent space. Latent space is able to encode whatever concepts it needs in whichever of its dimensions it needs, and is generally lower dimensional than pixel space. So we get a noisy latent space, denoise it using the diffusion model, then use the other model (variational autoencoder) to decode into pixel space.

[+] jampekka|1 year ago|reply
Not exactly 3D assets, but diffusion modems are used to generate e.g. traffic (vehicle trajectories) for evaluating autonomous vehicle algorithms. These vehicles tend to crash quite a lot.

For example https://github.com/NVlabs/CTG

Edit: fixed link

[+] tiborsaas|1 year ago|reply
Generating this at pixel level is the next level thing. The reverse engineering method your described is probably appealing because it's easier to understand.

Focusing on pixel level generation is the right approach I think. The somewhat noisy output will be improved upon probably in a short timeframe. Now that they proved with Doom (https://gamengen.github.io/) and this that it's possible, probably more research is happening currently to nail the correct architecture to scale this to HD and minimal hallucination. It happened with videos alredy so we should see a similar level breakthrough soon.

[+] gliptic|1 year ago|reply
> I _think_ I understand the basic premise behind stable diffusion, i.e., reverse the denoising process to generate realistic images but, as far as I know, this is always done at the pixel level.

It's typically not done at the pixel level, but at the "latent space" level of e.g. a VAE. The image generation is done in this space, which has fewer outputs than the pixels of the final image, and then converted to the pixels using the VAE.

[+] slashdave|1 year ago|reply
Stable diffusion is in latent space, not by pixel.
[+] cousin_it|1 year ago|reply
I continue to be puzzled by people who don't notice the "noise of hell" in NN pictures and videos. To me it's always recognizable and terrifying, has been from the start.
[+] npteljes|1 year ago|reply
What do you mean by noise of hell in particular? I do notice that the images are almost always uncanny in a way, but maybe we're not meaning the same thing. Could you elaborate on what you experience?
[+] taneq|1 year ago|reply
Like a subtle but unsettling babble/hubbub/cacophony? If so then I think I kind of know what you mean.
[+] HKH2|1 year ago|reply
Eyes have a lot of noise too.
[+] delusional|1 year ago|reply
I just checked it out right quick. It works perfectly well on an AMD card with ROCM pytorch.

It seems decent in short bursts. As it goes on it quite quickly loses detail and the weapon has a tendency to devolve into colorful garbage. I would also like to point out that none of the videos show what happens when you walk into a wall. It doesn't handle it very gracefully.

[+] DrSiemer|1 year ago|reply
Where it gets really interesting is if we can train a model on the latest GTA, plus maybe related real life footage, and then use it to live upgrade the visuals of an old game like Vice City.

The lack of temporal consistency will still make it feel pretty dreamlike, but it won't matter that much, because the base is consistent and it will look amazing.

[+] InsideOutSanta|1 year ago|reply
Just redrawing images drawn by an existing game engine works, and generates amazing results, although like you point out, temporal consistency is not great. It might interpret the low-res green pixels on a far-away mountain as fruit trees in one frame, and as pines in the next.

Here's a demo from 2021 doing something like that: https://www.youtube.com/watch?v=3rYosbwXm1w

[+] davedx|1 year ago|reply
A game like GTA has way too much functionality and complex branching for this to work I think (beyond eg doing aimless drives around the city — which would be very cool though)
[+] empath75|1 year ago|reply
People focusing on the use of this in video games baffles me. The point isn't that it can regenerate a videogame world, the point is that it can simulate the _real world_. They're using video game footage to train it because it's cheap and easy to synthesize the data they need. This system doesn't know it's simulating a game. You can give it thousands or millions of hours of real world footage and agent input and get a simulation of the real world.
[+] taneq|1 year ago|reply
Using it as a visual upgrade is pretty close to what DLSS does so that sounds plausible.
[+] skydhash|1 year ago|reply
Why not just creating the assets with higher resolution?
[+] mungoman2|1 year ago|reply
This is getting ridiculous!

Curious, since this is a strong loop old frame + input -> new frame, What happens if a non-CS image is used to start it off? Or a map the model has never seen. Will the model play ball, or will it drift back to known CS maps?

[+] Arch-TK|1 year ago|reply
Looks like it only knows Dust 2 since every single "dream" (I'm going to call them that since looking at this stuff feels like dreaming about Dust 2) is of that map only.
[+] ilaksh|1 year ago|reply
I wonder if there is some way to combine this with a language model, or somehow have the language model in the same latent space or something.

Is that was vision-language models already do? Somehow all of the language should be grounded in the world model. For models like Gemini that can answer questions about video, it must have some level of this grounding already.

I don't understand how this stuff works, but compressing everything to one dimension as in a language model for processing seems inefficient. The reason our language is serial is because we can only make one sound at a time.

But suppose the "game" trained on was a structural engineering tool. The user asks about some scenario for a structure and somehow that language is converted to an input visualization of the "game state". Maybe some constraints to be solved for are encoded also somehow as part of that initial state.

Then when it's solved (by an agent trained through reinforcement learning that uses each dreamed game state as input?), the result "game state" is converted somehow back into language and combined with the original user query to provide an answer.

But if I understand properly, the biggest utility of this is that there is a network that understands how the world works, and that part of the network can be utilized for predicting useful actions or maybe answering questions etc. ?

[+] fancyfredbot|1 year ago|reply
Strangely the paper doesn't seem to give much detail on the cs-go example. Actually the paper explicitly mentions it's limited to discrete control environments. Unless I'm missing something the mouse input for counterstrike isn't discrete and wouldn't work.

I'm not sure why the title says it was trained on 2x4090 either as I can't see this on either the linked page or the paper. The paper mentions a GPU year of 4090 compute was used to train the Atari model.

[+] akomtu|1 year ago|reply
The current batch of ML models looks a lot like filling in holes in the wall of text, drawings or movies: you erase a part of the wall and tell it to fix it. And it fills in the hole using colors from the nearby walls in the kitchen and similar walls and we watch this in awe thinking it must've figured out the design rules of the kitchen. However what it's really done is it interpolated the gaps with some sort of basic functions, trigonometric polynomials for example, and it used thousands of those. This solution wouldn't occur to us because our limited memory isn't enough for thousands of polynomials: we have to find a compact set of rules or give up entirely. So when these ML models predict the motion of planets, they approximate the Newton's law with a long series of basic functions.
[+] ThouYS|1 year ago|reply
I don't really understand the intuition on why this helps RL. The original game has a lot more detail, why can't it be used directly?
[+] shahzaibmushtaq|1 year ago|reply
As I used to play CS 1.6 and CS: GO in my free time before the pandemic, this playable CS diffusion world map has been trained by a noob player for research purposes.

After reading the comments I can assume that if you play outside of the scope it was trained on, the game loses its functionality.

Nevertheless, R&D for a good cause is something we all admire.

[+] thenthenthen|1 year ago|reply
When my game starts to look like this, I know it is time to quit hahha, maybe a helpful tool in gaming addiction therapy? The morphing of the gun/skins and the environment (the sandbags) wow. Would like to play this and see what happens when you walk backwards, turn around quick, use ‘noclip’ :D
[+] Zealotux|1 year ago|reply
Could we imagine parts of game elements to become "targets" for models? For example hair and fur physics have been notoriously difficult to nail, but it should be easier to use AI to simulate some fake physics on top of the rendered frame, right? Is anyone working on that?
[+] LarsDu88|1 year ago|reply
Iterative denoising diffusion is such a hurdle for getting this sort of thing running at reasonable fps
[+] advael|1 year ago|reply
Dang this is the first paper I've seen in a while that makes me think I need new GPUs
[+] madaxe_again|1 year ago|reply
I earnestly think this is where all gaming will go in the next five years - it’s going to be so compelling that stuff already under development will likely see a shift to using diffusion models. As this is demonstrating, a sufficiently honed model can produce realtime graphics - and some of the demos floating around where people are running GTA San Andreas through non-realtime models hint as to where this will go.

I give it the same five years before there are games entirely indistinguishable from reality, and I don’t just mean graphical fidelity - there’s no reason that the same or another model couldn’t provide limitless physics - bust a hole through that wall, set fire to this refrigerator, whatever.