That felt so wrong AND someone is cheating here. This felt really suspicious...
I got to the graffiti world and there were some stairs right next to me. So I started going up them. It felt like I was walking forward and the stairs were pushing under me until I just got stuck. So I turned to go back down and half way around everything morphed and I ended up back down at the ground level where I originally was. I was teleported. That's why I feel like something is cheating here. If we had mode collapse I'm not sure how we should be able to completely recover our entire environment. Not unless the model is building mini worlds with boundaries. It was like the out of bond teleportation you get in some games but way more fever dream like. That's not what we want from these systems, we don't want to just build a giant poorly compressed videogame, we want continuous generation. If you have mode collapse and recover, it should recover to somewhere new, now where you've been. At least this is what makes me highly suspicious.
Yes the thing that got me was i went through the channels multiple times (multiple browser sessions). The channels are the same everytime (the numbers don't align to any navigation though - flip back and forth between two numbers and you'll just hit a random channel everytime - don't be fooled by that). Every object is in the same position and the layout is the same.
What makes this AI generated over just rendering a generated 3D scene?
Like it may seem impressive to have no glitches (often in AI generated works you can turn around a full rotation and you're what's in front of you isn't what was there originally) but here it just acts as a fully modelled 3D scene rendering at low resolution? I can't even walk outside of certain bounds which doesn't make sense if this really is generated on the fly.
This needs a lot of skepticism and i'm surprised you're the first commenting on the lack of actual generation here. It's a series of static scenes rendered at low fidelity with limited bounds.
Hi! CEO of Odyssey here. Thanks for giving this a shot.
To clarify: this is a diffusion model trained on lots of video, that's learning realistic pixels and actions. This model takes in the prior video frame and a user action (e.g. move forward), with the model then generating a new video frame that resembles the intended action. This loop happens every ~40ms, so real-time.
The reason you're seeing similar worlds with this production model is that one of the greatest challenges of world models is maintaining coherence of video over long time periods, especially with diverse pixels (i.e. not a single game). So, to increase reliability for this research preview—meaning multiple minutes of coherent video—we post-trained this model on video from a smaller set of places with dense coverage. With this, we lose generality, but increase coherence.
> One of the biggest challenges is that world models require autoregressive modeling, predicting future state based on previous state. This means the generated outputs are fed back into the context of the model. In language, this is less of an issue due to its more bounded state space. But in world models—with a far higher-dimensional state—it can lead to instability, as the model drifts outside the support of its training distribution. This is particularly true of real-time models, which have less capacity to model complex latent dynamics.
> To improve autoregressive stability for this research preview, what we’re sharing today can be considered a narrow distribution model: it's pre-trained on video of the world, and post-trained on video from a smaller set of places with dense coverage. The tradeoff of this post-training is that we lose some generality, but gain more stable, long-running autoregressive generation.
> To broaden generalization, we’re already making fast progress on our next-generation world model. That model—shown in raw outputs below—is already demonstrating a richer range of pixels, dynamics, and actions, with noticeably stronger generalization.
It’s essentially this paper but applied to a bunch of video recordings of a bunch of different real world locations instead of counter strike maps. Each channel is just changing the location.
> Not unless the model is building mini worlds with boundaries.
Right. I was never able to get very far from the starting point, and kept getting thrown back to the start. It looks like they generated a little spherical image, and they're able to extrapolate a bit from that. Try to go through a door or reach a distant building, and you don't get there.
I mean... https://news.ycombinator.com/item?id=44121671
informed you of exactly why this happens a whole hour before you posted this comment and the creator is chatting with people in the comments. I get that you feel personally cheated, but I really don't think anyone was deliberately trying to cheat you. In light of that, your comment (and i only say this because it's the top comment on this post) is effectively a stereotypical "who needs dropbox" levels of shallow dismissal.
It feels like an interpolated Street View imagery. There is one scene with two people between cars in a parking lot. It is the only one I have found that has objects you would expect to change over time. When exploring the scene, those people sometimes disappear altogether and sometimes teleport around, as they would when exploring Street View panoramas. You can clearly tell when you are switching between photos taken a few seconds apart.
Note that it isn't being created from whole cloth, it is trained on videos of the places and then it is generating the frames:
"To improve autoregressive stability for this research preview, what we’re sharing today can be considered a narrow distribution model: it's pre-trained on video of the world, and post-trained on video from a smaller set of places with dense coverage. The tradeoff of this post-training is that we lose some generality, but gain more stable, long-running autoregressive generation."
Well, that felt like entering a dream on my phone. Fuzzy virtual environments generated by "a mind" based on its memory of real environments...
I wonder if it'd break our brains more if the environment changes as the viewpoint changes, but doesn't change back (e.g. if there's a horse, you pan left, pan back right, and the horse is now a tiger).
I kept expecting that to happen, but it apparently has some mechanism to persist context outside the user’s FOV.
In a way, that almost makes it more dreamlike, in that you have what feels like high local coherence (just enough not to immediately tip you off that it’s a dream) that de-coheres over time as you move through it.
This is pretty much the same thing as those models that baked dust2 into a diffusion model then used the last few frames as context to continue generating - same failure modes and everything.
This is similar to the Minecraft version of this from a few months back [0], but it does seem to have a better time keeping a memory of what you've already seen, at least for a bit. Spinning in circles doesn't lose your position quite as easily, but I did find that exiting a room and then turning back around and re-entering leaves you with a totally different room than you exited.
Only at first glance. It can easily render things that would be very hard to implement in an FPS engine.
What AI can dream up in milliseconds could take hundreds of human hours to encode using traditional tech (meshes, shaders, ray tracing, animation, logic scripts, etc.), and it still wouldn't look as natural and smooth as AI renderings — I refer to the latest developments in video synthesis like Google's Veo 3. Imagine it as a game engine running in real time.
I think an actual 3D engine with AI that can make new high quality 3D models and environments on the fly would be the pinnacle. And maybe even add new game and control mechanics on the fly.
it’s super cool. I keep thinking it kind of feels like dream logic. It looks amazing at first but I’m not sure I’d want to stay in a world like that for too long. I actually like when things have limits. When the world pushes back a bit and gives you rules to work with.
Doesn't it have rules? I couldn't move past a certain point and hitting a wall made you teleport. Maybe I was just rationalizing random events, though.
I found an interesting glitch where you could never actually reach a parked car, as you move forward the car also moved. It looked a lot like traffic moving through Google Street View.
Yeah. I found the same thing. Cars would disappear in front of me, then I reached the end of the world and it reset me. I'm not sure I believe this is AI, and instead some crappy street view interface.
Hi HN, I hope you enjoy our research preview of interactive video!
We think it's a glimpse of a totally new medium of entertainment, where models imagine compelling experiences in real-time and stream them to any screen.
This is amazing! I think the AI will completely replace the way we create and consume media currently. A well written story, with an amazing graphics generation AI can be both interactive and surprising every time you watch it again.
I'm unable to navigate anywhere. I'm on a laptop with a touchscreen and a trackpad. I clicked, double clicked, scrolled, and tried everything I could think of and the views just hovered around the same spot.
To me, this is evidence we're not in a simulation. Even with a gazillion H100's the model runs out of memory just (very roughly) simulating a 50'x50' space over just a few seconds.
In playing with this it was unclear to me how this differs from a pre-programmed 3d world with bit mapped walls. What is the AI adding that I wouldn't get otherwise?
If I had to choose one, I'd easily say maintaining video coherence over long periods of time. The typical failure case of world models that's attempting to generate diverse pixels (i.e. beyond a single video game) is that they degrade to a mush of incoherent pixels after 10-20 seconds of video.
We talk about this challenge in our blog post here (https://odyssey.world/introducing-interactive-video). There's specifics in there on how we improved coherence for this production model, and our work to improve this further with our next-gen model. I'm really proud of our work here!
> Compared to language, image, or video models, world models are still nascent—especially those that run in real-time. One of the biggest challenges is that world models require autoregressive modeling, predicting future state based on previous state. This means the generated outputs are fed back into the context of the model. In language, this is less of an issue due to its more bounded state space. But in world models—with a far higher-dimensional state—it can lead to instability, as the model drifts outside the support of its training distribution. This is particularly true of real-time models, which have less capacity to model complex latent dynamics. Improving this is an area of research we're deeply invested in.
In second place would absolutely be model optimization to hit real-time. That's a gnarly problem, where you're delicately balancing model intelligence, resolution, and frame-rate.
I do not get the "interactive" part. I expect to be able to manipulate objects or at least move them, you know, "interact" with the "video". Now it is some cheap walking simulator, without narration or any plot. Disappearing lamp posts when you get near them also should not be considered an interaction.
Maybe you should take a bit different approach to interactive videos, and let's say build a tech review video for some gadget or device, where viewer could interrupt host, using voice, and ask them questions, skip to some part or repeat something in more detail, explain some concept, even compare to other devices.
For sure, but consider it a "first draft" of what this type of generative AI can do.
The resolution is extremely low. The website doesn't specify, but I'd guess it's only 160x120. Such a low resolution was necessary to render it in real time and maintain a reasonable frame rate. To try to hide the blurring a bit, they apply some filters to add scan lines and other effects to make it look like an old TV.
That said, I'd be surprised if anybody could gather the hardware to work well enough to get it to a useable resolution, let alone even something like 1080p. It's literally over 100x the pixels of 160x120.
I think this step towards a more immerse virtual reality can actually be dangerous. A lot of intellectual types might disagree but I do think that creating such immersion is a dangerous thing because it will reduce the value people place on the real world and especially the natural world, making them even less likely to care if big corporations screw it up with biospheric degradation.
It seems like it has a high chance of leading to even more narcissism as well because we are reducing our dependence on others to such a degree that we will care about others less and less, which is something that has already started happening with increasingly advanced interactive technology like AI.
> I think this step towards a more immerse virtual reality can actually be dangerous
I don't think its a step toward that; I think this is literally trained using techniques to generate more immersive virtual reality that already exists and takes less compute, to produce a more computationally expensive and less accurate AI version.
At least, that's what every other demo of a real-time interactive AI world model has been, and they aren't trumpeting any clear new distinction.
This is why we never see any alien life. When they reach a sufficient level of technology, they realize the virtual/mental universe is much more compelling and fun than the boring rule-bound physical one.
godelski|9 months ago
I got to the graffiti world and there were some stairs right next to me. So I started going up them. It felt like I was walking forward and the stairs were pushing under me until I just got stuck. So I turned to go back down and half way around everything morphed and I ended up back down at the ground level where I originally was. I was teleported. That's why I feel like something is cheating here. If we had mode collapse I'm not sure how we should be able to completely recover our entire environment. Not unless the model is building mini worlds with boundaries. It was like the out of bond teleportation you get in some games but way more fever dream like. That's not what we want from these systems, we don't want to just build a giant poorly compressed videogame, we want continuous generation. If you have mode collapse and recover, it should recover to somewhere new, now where you've been. At least this is what makes me highly suspicious.
AnotherGoodName|9 months ago
What makes this AI generated over just rendering a generated 3D scene?
Like it may seem impressive to have no glitches (often in AI generated works you can turn around a full rotation and you're what's in front of you isn't what was there originally) but here it just acts as a fully modelled 3D scene rendering at low resolution? I can't even walk outside of certain bounds which doesn't make sense if this really is generated on the fly.
This needs a lot of skepticism and i'm surprised you're the first commenting on the lack of actual generation here. It's a series of static scenes rendered at low fidelity with limited bounds.
olivercameron|9 months ago
To clarify: this is a diffusion model trained on lots of video, that's learning realistic pixels and actions. This model takes in the prior video frame and a user action (e.g. move forward), with the model then generating a new video frame that resembles the intended action. This loop happens every ~40ms, so real-time.
The reason you're seeing similar worlds with this production model is that one of the greatest challenges of world models is maintaining coherence of video over long time periods, especially with diverse pixels (i.e. not a single game). So, to increase reliability for this research preview—meaning multiple minutes of coherent video—we post-trained this model on video from a smaller set of places with dense coverage. With this, we lose generality, but increase coherence.
We share a lot more about this in our blog post here (https://odyssey.world/introducing-interactive-video), and share outputs from a more generalized model.
> One of the biggest challenges is that world models require autoregressive modeling, predicting future state based on previous state. This means the generated outputs are fed back into the context of the model. In language, this is less of an issue due to its more bounded state space. But in world models—with a far higher-dimensional state—it can lead to instability, as the model drifts outside the support of its training distribution. This is particularly true of real-time models, which have less capacity to model complex latent dynamics.
> To improve autoregressive stability for this research preview, what we’re sharing today can be considered a narrow distribution model: it's pre-trained on video of the world, and post-trained on video from a smaller set of places with dense coverage. The tradeoff of this post-training is that we lose some generality, but gain more stable, long-running autoregressive generation.
> To broaden generalization, we’re already making fast progress on our next-generation world model. That model—shown in raw outputs below—is already demonstrating a richer range of pixels, dynamics, and actions, with noticeably stronger generalization.
Let me know any questions. Happy to go deeper!
jowday|9 months ago
https://diamond-wm.github.io/
Animats|9 months ago
Right. I was never able to get very far from the starting point, and kept getting thrown back to the start. It looks like they generated a little spherical image, and they're able to extrapolate a bit from that. Try to go through a door or reach a distant building, and you don't get there.
unknown|9 months ago
[deleted]
throwaway314155|9 months ago
magnat|9 months ago
bufferoverflow|9 months ago
I call BS.
qingcharles|9 months ago
"To improve autoregressive stability for this research preview, what we’re sharing today can be considered a narrow distribution model: it's pre-trained on video of the world, and post-trained on video from a smaller set of places with dense coverage. The tradeoff of this post-training is that we lose some generality, but gain more stable, long-running autoregressive generation."
https://odyssey.world/introducing-interactive-video
doug_durham|9 months ago
thetoon|9 months ago
netsharc|9 months ago
I wonder if it'd break our brains more if the environment changes as the viewpoint changes, but doesn't change back (e.g. if there's a horse, you pan left, pan back right, and the horse is now a tiger).
mortenjorck|9 months ago
In a way, that almost makes it more dreamlike, in that you have what feels like high local coherence (just enough not to immediately tip you off that it’s a dream) that de-coheres over time as you move through it.
Fascinatingly strange demo.
afro88|9 months ago
jowday|9 months ago
https://diamond-wm.github.io/
lolinder|9 months ago
[0] Minecraft with object impermanence (229 points, 146 comments) https://news.ycombinator.com/item?id=42762426
spzb|9 months ago
xpl|9 months ago
What AI can dream up in milliseconds could take hundreds of human hours to encode using traditional tech (meshes, shaders, ray tracing, animation, logic scripts, etc.), and it still wouldn't look as natural and smooth as AI renderings — I refer to the latest developments in video synthesis like Google's Veo 3. Imagine it as a game engine running in real time.
whamlastxmas|9 months ago
Morizero|9 months ago
https://m.youtube.com/watch?v=YXPIv7pS59o
OfficeChad|8 months ago
[deleted]
Daisywh|9 months ago
aswegs8|9 months ago
booleandilemma|9 months ago
arvindh-manian|9 months ago
Traubenfuchs|9 months ago
I LOVE dreamy AI content. That stuff where everything turned into dogs for example.
As AI is maturing, we are slowly losing that im favor of boring realism and coherence.
hx8|9 months ago
Hobadee|9 months ago
olivercameron|9 months ago
We think it's a glimpse of a totally new medium of entertainment, where models imagine compelling experiences in real-time and stream them to any screen.
Once you've taken the research preview for a whirl, you can learn a lot more about our technical work behind this here (https://odyssey.world/introducing-interactive-video).
squiffy|9 months ago
bkmeneguello|9 months ago
lwo32k|9 months ago
fortran77|9 months ago
qwerty59|9 months ago
tiahura|9 months ago
unknown|9 months ago
[deleted]
chenxi9649|9 months ago
ie. as opposed to first generating a 3d env then doing some sorts of img2img on top of it?
jedberg|9 months ago
deadbabe|9 months ago
amelius|9 months ago
olivercameron|9 months ago
givemeethekeys|9 months ago
akomtu|9 months ago
gcanyon|9 months ago
exe34|9 months ago
abe94|9 months ago
olivercameron|9 months ago
We talk about this challenge in our blog post here (https://odyssey.world/introducing-interactive-video). There's specifics in there on how we improved coherence for this production model, and our work to improve this further with our next-gen model. I'm really proud of our work here!
> Compared to language, image, or video models, world models are still nascent—especially those that run in real-time. One of the biggest challenges is that world models require autoregressive modeling, predicting future state based on previous state. This means the generated outputs are fed back into the context of the model. In language, this is less of an issue due to its more bounded state space. But in world models—with a far higher-dimensional state—it can lead to instability, as the model drifts outside the support of its training distribution. This is particularly true of real-time models, which have less capacity to model complex latent dynamics. Improving this is an area of research we're deeply invested in.
In second place would absolutely be model optimization to hit real-time. That's a gnarly problem, where you're delicately balancing model intelligence, resolution, and frame-rate.
Euphorbium|9 months ago
butz|9 months ago
mensetmanusman|9 months ago
andrewstuart|9 months ago
yieldcrv|9 months ago
lerp-io|9 months ago
rehaan0807|9 months ago
[deleted]
scudsworth|9 months ago
[deleted]
Sohcahtoa82|9 months ago
The resolution is extremely low. The website doesn't specify, but I'd guess it's only 160x120. Such a low resolution was necessary to render it in real time and maintain a reasonable frame rate. To try to hide the blurring a bit, they apply some filters to add scan lines and other effects to make it look like an old TV.
That said, I'd be surprised if anybody could gather the hardware to work well enough to get it to a useable resolution, let alone even something like 1080p. It's literally over 100x the pixels of 160x120.
vouaobrasil|9 months ago
It seems like it has a high chance of leading to even more narcissism as well because we are reducing our dependence on others to such a degree that we will care about others less and less, which is something that has already started happening with increasingly advanced interactive technology like AI.
dragonwriter|9 months ago
I don't think its a step toward that; I think this is literally trained using techniques to generate more immersive virtual reality that already exists and takes less compute, to produce a more computationally expensive and less accurate AI version.
At least, that's what every other demo of a real-time interactive AI world model has been, and they aren't trumpeting any clear new distinction.
mvdtnz|9 months ago
TaupeRanger|9 months ago