This video https://x.com/Sentdex/status/1845146540555243615 looks way too much like my dreams. This is almost exactly that happens when I sometimes try to jump high, it transforms me to a different place just like that. Things keep changing just like that. It's amazing to see how close it is to a real dream experience.
I noticed that all text looked garbled up when I had some lucid dreams. When diffusion models started to gain attention, I made the connection that text generated in generated images also looked garbled up.
Maybe all of those are clues that parts of the human subconscious mind operate pretty close to the principles behind diffusion models.
What’s amazing is that if you really start paying attention it seems like the mind is often doing the same thing when you’re awake, less noticeable with your visual field but more noticeable with attention and thoughts themselves.
Its interesting how much dreams differ from person to person. Mine tend to be completely coherant visually, to the point that I have used google maps in my dreams, and while the geography was inaccurate, it was consistent. However, I have never been lucid within a dream, maybe that makes a difference
How are you so sure this is like your dreams? If it was easy to accurately remember dreams why would they be all so smooshy and such a jumbled mess like in this video?
We are unconsciously (pun intended) implementing how brains work both in dream and wake states.
Can't wait until we add some kind of (lossless) memory to this models.
I can how this can already be used to generate realistic physics approximations in a game engine. You create a bunch of snippets of gameplay using a much heavier and realistic physics engine (perhaps even CGI). The model learn to approximate the physics and boom, now you have a lightweight physics engine. Perhaps you can even have several that are specialized (e.g. one for smoke dynamics, one for explosions, ...). Even if it allucinates, wouldn't be worse than the physics bugs that are so common in games.
Does it respects/builds some kind of game map in the process or is it just a bizarre psychedelic dream walk experience where you cannot go back the same place twice and space dimensions are just funny? Is a game map finite?
Just looking at the first video, there's a section where structures just suddenly appear in front of the player, so this does not appear to build any kind of map, or have any kind of meaningful awareness of something resembling a game state.
This is similar to LLM-based RPGs I've played, where you can pick up a sword and put it in your empty bag, and then pull out a loaf of bread and eat it.
Just skimmed the article but my guess is that it’s a dream type experience where if you turned around 180 and walked the other direction it wouldn’t correspond to where you just came from. More like an infinite map.
Just tried it out, and no. It doesn't have any sort of "map" awareness. It's very much in the "recall/replay" category of "AI" where it seems to accurately recall stuff that is part of the training dataset, but as soon as you do something not in there (like walk into a wall), it completely freaks out and spits out gibberish. Plausible gibberish, but gibberish none the less.
I _think_ I understand the basic premise behind stable diffusion, i.e., reverse the denoising process to generate realistic images but, as far as I know, this is always done at the pixel level. Is there any research attempting to do this at the 3D asset level, i.e., subbing in game engine assets (with position and orientation) until a plausible scene is recreated? If it were possible to do it that way, couldn't it "dream" up real maps, with real physics, and so avoid the somewhat noisy output these types of demo generate?
I think the closest we have right now is 3D gaussian splatting.
So far it's only been used to train a scene from photographs from multiple angles and rebuild it volumetrically by adjusting densities in a point-cloud.
But it might be possible to train a model on multiple different scenes, and perform diffusion on a random point cloud to generate new scenes.
Rendering a point cloud in real time is also very efficient, so it could be used to create insanely realistic game worlds instead of polygonal geometry.
> but, as far as I know, this is always done at the pixel level
Image models are NOT denoised at the pixel level - diffusion happens in latent space. This was one of the big breakthroughs that made all of this work well.
There's a model for encoding/decoding between pixels and latent space. Latent space is able to encode whatever concepts it needs in whichever of its dimensions it needs, and is generally lower dimensional than pixel space. So we get a noisy latent space, denoise it using the diffusion model, then use the other model (variational autoencoder) to decode into pixel space.
Not exactly 3D assets, but diffusion modems are used to generate e.g. traffic (vehicle trajectories) for evaluating autonomous vehicle algorithms. These vehicles tend to crash quite a lot.
Generating this at pixel level is the next level thing. The reverse engineering method your described is probably appealing because it's easier to understand.
Focusing on pixel level generation is the right approach I think. The somewhat noisy output will be improved upon probably in a short timeframe. Now that they proved with Doom (https://gamengen.github.io/) and this that it's possible, probably more research is happening currently to nail the correct architecture to scale this to HD and minimal hallucination. It happened with videos alredy so we should see a similar level breakthrough soon.
> I _think_ I understand the basic premise behind stable diffusion, i.e., reverse the denoising process to generate realistic images but, as far as I know, this is always done at the pixel level.
It's typically not done at the pixel level, but at the "latent space" level of e.g. a VAE. The image generation is done in this space, which has fewer outputs than the pixels of the final image, and then converted to the pixels using the VAE.
I continue to be puzzled by people who don't notice the "noise of hell" in NN pictures and videos. To me it's always recognizable and terrifying, has been from the start.
What do you mean by noise of hell in particular? I do notice that the images are almost always uncanny in a way, but maybe we're not meaning the same thing. Could you elaborate on what you experience?
I just checked it out right quick. It works perfectly well on an AMD card with ROCM pytorch.
It seems decent in short bursts. As it goes on it quite quickly loses detail and the weapon has a tendency to devolve into colorful garbage. I would also like to point out that none of the videos show what happens when you walk into a wall. It doesn't handle it very gracefully.
Where it gets really interesting is if we can train a model on the latest GTA, plus maybe related real life footage, and then use it to live upgrade the visuals of an old game like Vice City.
The lack of temporal consistency will still make it feel pretty dreamlike, but it won't matter that much, because the base is consistent and it will look amazing.
Just redrawing images drawn by an existing game engine works, and generates amazing results, although like you point out, temporal consistency is not great. It might interpret the low-res green pixels on a far-away mountain as fruit trees in one frame, and as pines in the next.
A game like GTA has way too much functionality and complex branching for this to work I think (beyond eg doing aimless drives around the city — which would be very cool though)
People focusing on the use of this in video games baffles me. The point isn't that it can regenerate a videogame world, the point is that it can simulate the _real world_. They're using video game footage to train it because it's cheap and easy to synthesize the data they need. This system doesn't know it's simulating a game. You can give it thousands or millions of hours of real world footage and agent input and get a simulation of the real world.
Curious, since this is a strong loop old frame + input -> new frame, What happens if a non-CS image is used to start it off? Or a map the model has never seen. Will the model play ball, or will it drift back to known CS maps?
Looks like it only knows Dust 2 since every single "dream" (I'm going to call them that since looking at this stuff feels like dreaming about Dust 2) is of that map only.
I wonder if there is some way to combine this with a language model, or somehow have the language model in the same latent space or something.
Is that was vision-language models already do? Somehow all of the language should be grounded in the world model. For models like Gemini that can answer questions about video, it must have some level of this grounding already.
I don't understand how this stuff works, but compressing everything to one dimension as in a language model for processing seems inefficient. The reason our language is serial is because we can only make one sound at a time.
But suppose the "game" trained on was a structural engineering tool. The user asks about some scenario for a structure and somehow that language is converted to an input visualization of the "game state". Maybe some constraints to be solved for are encoded also somehow as part of that initial state.
Then when it's solved (by an agent trained through reinforcement learning that uses each dreamed game state as input?), the result "game state" is converted somehow back into language and combined with the original user query to provide an answer.
But if I understand properly, the biggest utility of this is that there is a network that understands how the world works, and that part of the network can be utilized for predicting useful actions or maybe answering questions etc. ?
Strangely the paper doesn't seem to give much detail on the cs-go example. Actually the paper explicitly mentions it's limited to discrete control environments. Unless I'm missing something the mouse input for counterstrike isn't discrete and wouldn't work.
I'm not sure why the title says it was trained on 2x4090 either as I can't see this on either the linked page or the paper. The paper mentions a GPU year of 4090 compute was used to train the Atari model.
The current batch of ML models looks a lot like filling in holes in the wall of text, drawings or movies: you erase a part of the wall and tell it to fix it. And it fills in the hole using colors from the nearby walls in the kitchen and similar walls and we watch this in awe thinking it must've figured out the design rules of the kitchen. However what it's really done is it interpolated the gaps with some sort of basic functions, trigonometric polynomials for example, and it used thousands of those. This solution wouldn't occur to us because our limited memory isn't enough for thousands of polynomials: we have to find a compact set of rules or give up entirely. So when these ML models predict the motion of planets, they approximate the Newton's law with a long series of basic functions.
As I used to play CS 1.6 and CS: GO in my free time before the pandemic, this playable CS diffusion world map has been trained by a noob player for research purposes.
After reading the comments I can assume that if you play outside of the scope it was trained on, the game loses its functionality.
Nevertheless, R&D for a good cause is something we all admire.
When my game starts to look like this, I know it is time to quit hahha, maybe a helpful tool in gaming addiction therapy? The morphing of the gun/skins and the environment (the sandbags) wow. Would like to play this and see what happens when you walk backwards, turn around quick, use ‘noclip’ :D
Could we imagine parts of game elements to become "targets" for models? For example hair and fur physics have been notoriously difficult to nail, but it should be easier to use AI to simulate some fake physics on top of the rendered frame, right? Is anyone working on that?
I earnestly think this is where all gaming will go in the next five years - it’s going to be so compelling that stuff already under development will likely see a shift to using diffusion models. As this is demonstrating, a sufficiently honed model can produce realtime graphics - and some of the demos floating around where people are running GTA San Andreas through non-realtime models hint as to where this will go.
I give it the same five years before there are games entirely indistinguishable from reality, and I don’t just mean graphical fidelity - there’s no reason that the same or another model couldn’t provide limitless physics - bust a hole through that wall, set fire to this refrigerator, whatever.
[+] [-] smusamashah|1 year ago|reply
[+] [-] kleene_op|1 year ago|reply
Maybe all of those are clues that parts of the human subconscious mind operate pretty close to the principles behind diffusion models.
[+] [-] siavosh|1 year ago|reply
[+] [-] voidUpdate|1 year ago|reply
[+] [-] jvanderbot|1 year ago|reply
I think these models lack a world model, something with strong spatial reasoning and continuity expectations that animals have.
Of course that's probably learned too.
[+] [-] earnesti|1 year ago|reply
[+] [-] soheil|1 year ago|reply
[+] [-] thegabriele|1 year ago|reply
[+] [-] francoisfleuret|1 year ago|reply
This is what a big tech company was doing in 2015.
The same stuff at industrial scale à la large LLMs would be absolutely mind blowing.
[+] [-] gjulianm|1 year ago|reply
[+] [-] GaggiX|1 year ago|reply
[+] [-] cs702|1 year ago|reply
The rate of progress has been mind-blowing indeed.
We sure live in interesting times!
[+] [-] Sardtok|1 year ago|reply
[+] [-] marcyb5st|1 year ago|reply
I can how this can already be used to generate realistic physics approximations in a game engine. You create a bunch of snippets of gameplay using a much heavier and realistic physics engine (perhaps even CGI). The model learn to approximate the physics and boom, now you have a lightweight physics engine. Perhaps you can even have several that are specialized (e.g. one for smoke dynamics, one for explosions, ...). Even if it allucinates, wouldn't be worse than the physics bugs that are so common in games.
[+] [-] croo|1 year ago|reply
Does it respects/builds some kind of game map in the process or is it just a bizarre psychedelic dream walk experience where you cannot go back the same place twice and space dimensions are just funny? Is a game map finite?
[+] [-] InsideOutSanta|1 year ago|reply
This is similar to LLM-based RPGs I've played, where you can pick up a sword and put it in your empty bag, and then pull out a loaf of bread and eat it.
[+] [-] aidos|1 year ago|reply
[+] [-] delusional|1 year ago|reply
[+] [-] mk_stjames|1 year ago|reply
https://worldmodels.github.io/
Just want to point that out.
[+] [-] jmchambers|1 year ago|reply
[+] [-] desdenova|1 year ago|reply
So far it's only been used to train a scene from photographs from multiple angles and rebuild it volumetrically by adjusting densities in a point-cloud.
But it might be possible to train a model on multiple different scenes, and perform diffusion on a random point cloud to generate new scenes.
Rendering a point cloud in real time is also very efficient, so it could be used to create insanely realistic game worlds instead of polygonal geometry.
It seems someone already thought of that: https://ar5iv.labs.arxiv.org/html/2311.11221
[+] [-] furyofantares|1 year ago|reply
Image models are NOT denoised at the pixel level - diffusion happens in latent space. This was one of the big breakthroughs that made all of this work well.
There's a model for encoding/decoding between pixels and latent space. Latent space is able to encode whatever concepts it needs in whichever of its dimensions it needs, and is generally lower dimensional than pixel space. So we get a noisy latent space, denoise it using the diffusion model, then use the other model (variational autoencoder) to decode into pixel space.
[+] [-] jampekka|1 year ago|reply
For example https://github.com/NVlabs/CTG
Edit: fixed link
[+] [-] tiborsaas|1 year ago|reply
Focusing on pixel level generation is the right approach I think. The somewhat noisy output will be improved upon probably in a short timeframe. Now that they proved with Doom (https://gamengen.github.io/) and this that it's possible, probably more research is happening currently to nail the correct architecture to scale this to HD and minimal hallucination. It happened with videos alredy so we should see a similar level breakthrough soon.
[+] [-] gliptic|1 year ago|reply
It's typically not done at the pixel level, but at the "latent space" level of e.g. a VAE. The image generation is done in this space, which has fewer outputs than the pixels of the final image, and then converted to the pixels using the VAE.
[+] [-] slashdave|1 year ago|reply
[+] [-] cousin_it|1 year ago|reply
[+] [-] npteljes|1 year ago|reply
[+] [-] taneq|1 year ago|reply
[+] [-] HKH2|1 year ago|reply
[+] [-] delusional|1 year ago|reply
It seems decent in short bursts. As it goes on it quite quickly loses detail and the weapon has a tendency to devolve into colorful garbage. I would also like to point out that none of the videos show what happens when you walk into a wall. It doesn't handle it very gracefully.
[+] [-] DrSiemer|1 year ago|reply
The lack of temporal consistency will still make it feel pretty dreamlike, but it won't matter that much, because the base is consistent and it will look amazing.
[+] [-] InsideOutSanta|1 year ago|reply
Here's a demo from 2021 doing something like that: https://www.youtube.com/watch?v=3rYosbwXm1w
[+] [-] davedx|1 year ago|reply
[+] [-] empath75|1 year ago|reply
[+] [-] sorenjan|1 year ago|reply
https://www.reddit.com/r/aivideo/comments/1fx6zdr/gta_iv_wit...
[+] [-] taneq|1 year ago|reply
[+] [-] skydhash|1 year ago|reply
[+] [-] mungoman2|1 year ago|reply
Curious, since this is a strong loop old frame + input -> new frame, What happens if a non-CS image is used to start it off? Or a map the model has never seen. Will the model play ball, or will it drift back to known CS maps?
[+] [-] Arch-TK|1 year ago|reply
[+] [-] ilaksh|1 year ago|reply
Is that was vision-language models already do? Somehow all of the language should be grounded in the world model. For models like Gemini that can answer questions about video, it must have some level of this grounding already.
I don't understand how this stuff works, but compressing everything to one dimension as in a language model for processing seems inefficient. The reason our language is serial is because we can only make one sound at a time.
But suppose the "game" trained on was a structural engineering tool. The user asks about some scenario for a structure and somehow that language is converted to an input visualization of the "game state". Maybe some constraints to be solved for are encoded also somehow as part of that initial state.
Then when it's solved (by an agent trained through reinforcement learning that uses each dreamed game state as input?), the result "game state" is converted somehow back into language and combined with the original user query to provide an answer.
But if I understand properly, the biggest utility of this is that there is a network that understands how the world works, and that part of the network can be utilized for predicting useful actions or maybe answering questions etc. ?
[+] [-] fancyfredbot|1 year ago|reply
I'm not sure why the title says it was trained on 2x4090 either as I can't see this on either the linked page or the paper. The paper mentions a GPU year of 4090 compute was used to train the Atari model.
[+] [-] akomtu|1 year ago|reply
[+] [-] ThouYS|1 year ago|reply
[+] [-] shahzaibmushtaq|1 year ago|reply
After reading the comments I can assume that if you play outside of the scope it was trained on, the game loses its functionality.
Nevertheless, R&D for a good cause is something we all admire.
[+] [-] thenthenthen|1 year ago|reply
[+] [-] Zealotux|1 year ago|reply
[+] [-] LarsDu88|1 year ago|reply
[+] [-] advael|1 year ago|reply
[+] [-] madaxe_again|1 year ago|reply
I give it the same five years before there are games entirely indistinguishable from reality, and I don’t just mean graphical fidelity - there’s no reason that the same or another model couldn’t provide limitless physics - bust a hole through that wall, set fire to this refrigerator, whatever.