I think people might be missing what this enables. It can make plausible continuations of video, with realistic physics. What happens if this gets fast enough to work _in real time_.
Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.
You can probably already imagine different ways to wire the output to text generation and controlling its own motions, etc, and predicting outcomes based on actions it, itself could plausibly take, and choosing the best one.
It doesn't actually have to generate realistic imagery or imagery that doesn't have any mistakes or imagery that's high definition to be used in that way. How realistic is our own imagination of the world?
Edit: I'm going to add a specific case. Imagine a house cleaning robot. It starts with an image of your living room. Then it creates a image of your living room after it's been cleaned. Then it interpolates a video _imagining itself cleaning the room_, then acts as much as it can to mimic what's in the video, then generates a new continuation, then acts, and so on. Imagine doing that several times a second, if necessary.
You're talking about an agent with a world model used for planning. Actually generating realistic images is not really needed as the world model operates in its own compressed abstraction.
> Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.
In theory, yes. The problem is we've had AGI many times before, in theory. For example, Q learning, feed the state of any game or system through a neural network, have it predict possible future rewards, iteratively improve the accuracy of the reward predictions, and boom, eventually you arrive at the optimal behavior for any system. We've know this since... the 70's maybe? I don't know how far Q-learning goes back.
I like to do experiments with reinforcement learning and it's always exciting to think "once I turn this thing on, it's going to work well and find lots of neat solutions to the problem", and the thing is, it's true, that might happen, but usually it doesn't. Usually I see some signs of learning, but it fails to come up with anything spectacular.
I keep watching for a strong AI in a video game like Civilization as a sign that AI can solve problems in a highly complex system while also being practical enough that game creators are able to implement it in a practical way. Yes, maybe, maybe, a team with experts could solve Civilization as a research project, but that's far from being practical. Do you think we'll be able to show an AI a video of people playing Civilization and have the video predict the best moves before the AI in the game is able to predict the best moves?
What I find interesting is that b/c we have so much video data, we have this thing that can project the future in 2d pixel space.
Projecting into the future in 3d world space is actually what the endgame for robotics is and I imagine depending on how complex that 3d world model is, a working model for projecting into 3d space could be waaaaaay smaller.
It's just that the equivalent data is not as easily available on the internet :)
>> Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.
As another comment points out that's Yann LeCun's idea of "Objective-Driven AI" introduced in [1] though not named that in the paper (LeCun has named it that way in talks and slides). LeCun has also said that this won't be achieved with generative models. So, either 1 out of 2 right, or both wrong, one way or another.
For me, I've been in AI long enough to remember many such breakthroughs that would lead to AGI before - from DeepBlue (actually) to CNNs, to Deep RL, to LLMs just now, etc. Either all those were not the breakthroughs people thought at the time, or it takes many more than an engineering breakthrough to get to AGI, otherwise it's hard to explain why the field keeps losing its mind about the Next Big Thing and then forgetting about it a few years later, when the Next Next Big Thing comes around.
But, enough with my cynicism. You think that idea can work? Try it out. In a simplified environment. Take some stupid grid world, a simplification of a text-based game like Nethack [2] and try to implement your idea, in-vitro, as it were. See how well it works. You could write a paper about it.
I totally agree that a system like Sora is needed. By itself, it’s insufficient. With a multimodal model that can reason properly, then we get AGI or rather ASI (artificial super intelligence) due to many advantages over humans such as context length, access to additional sensory modalities (infrared, electroreception, etc), much broader expertise, huge bandwidth, etc.
future successor to Sora + likely successor to GPT-4 = ASI
Adding to this: Sora was most likely trained on video that's more like what you'd normally see on YouTube or in a clip art or media licensing company collection. Basically, video designed to look good as a part of a film or similar production.
So right now, Sora is predicting "Hollywood style" content, with cuts, camera motions, etc... all much like what you'd expect to see in an edited film.
Nothing stops someone (including OpenAI) from training the same architecture with "real world captures".
Imagine telling a bunch of warehouse workers that for "safety" they all need to wear a GoPro-like action camera on their helmets that record everything inside the work area. Run that in a bunch of warehouses with varying sizes, content, and forklifts, and then pump all of that through this architecture to train it. Include the instructions given to the staff from the ERP system as well as the transcribed audio as the text prompt.
Ta-da.
You have yourself an AI that can control a robot using the same action camera as its vision input. It will be able to follow instructions from the ERP, listen to spoken instructions, and even respond with a natural voice. It'll even be able to handle scenarios such as spills, breaks, or other accidents... just like the humans in its training data did. This is basically what vehicle auto-pilots do, but on steroids.
Sure, the computer power required for this is outrageously expensive right now, but give it ten to twenty years and... no more manual labour.
A 3d model with object permanency is definitely a step in the right direction of something or other but for clarity let us dial back down the level of graphical detail.
A Pacman bot is not AGI. Might get it to eat all the dots correctly where as before if something scrolled off the screen it'd forget about it and glitch out - but you didn't fan any flames of consciousness into existence as of yet.
The flip side of video or image gen is always video or image identification. If video gets really good then an AI can have quite an accurate visual view into the world in real time
Imagine where you want to be (eg, “I scored a goal!”) from where you are now, visualize how you’ll get there (eg, a trick and then a shot), then do that.
Thanks for adding the specific case. I think with testing these sort of limited domain applications make sense.
It'll be much harder for more open ended world problems where the physics encountered may be rare enough in the dataset that the simulation breaks unexpectedly. For example a glass smashing into the floor. The model doesn't simulate that causally afaik
FWIW, you've basically described at a high level exactly what autonomous driving systems have been doing for several years. I don't think anyone would say that Waymo's cars are really close to AGI.
This comment is brilliant. Thank you. I’m so excited now to build a bot that uses predictive video. I wonder what the most simple prototype would be? Surely one that has a simple validation loop that can say hey, this predicted video became true. Perhaps a 2D infinite scrolling video game?
I've also noticed on some of the featured videos that there are some perspective/parallax errors. The human subjects in some are either oversized compared to background people, or they end up on horizontal planes that don't line up properly. It's actually a bit vertigo-inducing! It is still very remarkable
The hyper realistic and plausible movement of the glass breaking makes this bizarrely fascinating. And it doesn’t give me the feeling of disgust the motion in the more primitive AI models did
> Other interactions, like eating food, do not always yield correct changes in object state
So this is why they haven't shown Will Smith eating spaghetti.
> These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world
This is exciting for robotics. But an even closer application would be filling holes in gaussian splatting scenes. If you want to make a 3D walkthrough of a space you need to take hundreds to thousands of photos with seamless coverage of every possible angle, and you're still guaranteed to miss some. Seems like a model this capable could easily produce plausible reconstructions of hidden corners or close up detail or other things that would just be holes or blurry parts in a standard reconstruction. You might only need five or ten regular photos of a place to get a completely seamless and realistic 3D scene that you could explore from any angle. You could also do things like subtract people or other unwanted objects from the scene. Such an extrapolated reconstruction might not be completely faithful to reality in every detail, but I think this could enable lots of applications regardless.
Do note that "reconstruction" is not the right word, the proper characterisation of that sort of imputation is "artist impression": good for situations where the precise details doesn't matter. Though of course if the details doesn't matter maybe blurry is fine.
AlphaGo and AlphaZero were able to achieve superhuman performance due to the availability of perfect simulators for the game of Go. There is no such simulator for the real world we live in (although pure LLMs sort of learn a rough, abstract representation of the world as perceived by humans.) Sora is an attempt to build such a simulator using deep learning.
“Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.”
General, superhuman robotic capabilities on the software side can be achieved once such a simulator is good enough. (Whether that can be achieved with this approach is still not certain.)
Why superhuman? Larger context length than our working memory is an obvious one, but there will likely be other advantages such as using alternative sensory modalities and more granular simulation of details unfamiliar to most humans.
Really interesting how this goes against my intuition. I would have imagined that it's infinitely easier to analyze a camera stream of the real world, then generate a polygonal representation of what you see (like you would do for a videogame) and then make AI decisions for that geometry. Instead the way that AI is going they rather skip it all and work directly on pixel data. Understanding of 3d geometry, perspective and physics is expected to evolve naturally from the training data.
There is a perfect simulator of the real world available. It can be recorded with a camera! Once the researchers have a bit of time to get their bearings and figure out how to train an order of magnitude faster we'll get there.
I think it's Ylecun who stated a few times that video was the better way to train large models as it's more information dense.
The results really are impressive. Being able to generate such high quality videos, to extend videos in the past and in the future shows how much the model "understands" the real world, objects interaction, 3D composition, etc...
Although image generation already requires the model to know a lot about the world, i think there's really a huge gap with video generation where the model needs to "know" 3D, objects movements and interactions.
Watching an entirely generated video of someone painting is crazy.
I can't wait to play with this but I can't even imagine how expensive it must be. They're training in full resolution and can generate up to a minute of video.
Seeing how bad video generation was, I expected it would take a few more years to get to this but it seems like this is another case of "Add data & compute"(TM) where transformers prove once again they'll learn everything and be great at it
I know the main post has been getting a lot of reaction, but this page absolutely blew me away. The results are striking.
The robot examples are very underwhelming, but the people and background people are all very well done, and at a level much better than most static image diffusion models produce. Generating the same people as the interact with objects is also not something I expected a model like this to do well so soon.
I find it wild that this model does not have explicit 3D prior, yet learns to generate videos with such 3D consistency, you can directly train a 3D representation (NeRF-like) from those videos: https://twitter.com/BenMildenhall/status/1758224827788468722
I was similarly astonished at this adaptation of stable diffusion to make HDR spherical environment maps from existing images- https://diffusionlight.github.io/
The crazy thing is that they do it by prompting the model to in paint a chrome sphere into the middle of the image to reflect what is behind the camera! The model can interpret the context and dream up what is plausibly in the whole environment.
You aren't looking carefully enough. I find so many inconsistencies in these examples. Perspectives that are completely wrong when the camera rotates. Windows that shift perspective, patios that are suddenly deep/shallow. Shadows that appear/disappear as the camera shifts. In other examples; paths, objects, people suddenly appearing or disappearing out of nowhere. A stone turning into a person. A horse that suddenly has a second head, then becomes a separate horse with only two legs.
It is impressive at a glance, but if you pay attention, it is more like dreaming than realism (images conjured out of other images, without attention to long term temporal, spatial, and causal consistency). I'm hardly more impressed that Google's deep dream, which is 10 years old.
That's an interesting idea. Analogous to how LLMs are simply "text predictors" but end up having to learn a model of language and the world to correctly predict cohesive text, it makes sense that "video predictors" also have to learn a model of the world that makes sense. I wonder how many orders of magnitude further they have to evolve to be similarly useful.
If they would allow this (maybe a premium+ model) they could soon destroy the whole porn industry. not the websites, but the (often abused) sex workers. Everyone could describe that fetish they are into and get it visualized instantly without the need of physical human suffering to produce these videos.
I know its a delicate topic people (especially in the US) don't want to speak about at all, but damn, this is a giant market and could do humanity good if done well.
There are thousands of porn consumers with destroyed reward circuitry per every porn actor, of which few are mistreated and the majority are compensated very well.
Producing a neverending supply of wireheading-like addictive stimuli is the farthest possible thing from a good for humanity.
Want to do good in this area - work on ways to limit consumption.
Video will be especially important for language models to grasp physical actions that are instinctive and obvious to humans but not explicitly detailed in text or video captions. I mentioned this in 2022:
While it seems plausible that eventually you could build a game around one of these models, the lack of an underlying state representation that you can permute in a precise way is a pretty strong barrier to anything resembling real user-and-system interaction. Even expressing pong through text prompts in a way that would produce desirable results in this is a tough challenge.
I could imagine a text adventure game with a 'visual' component perhaps working if you got the model to maintain enough consistency in spaces and character appearances.
I was wondering how feasible it would be to make a Minecraft agent that had a running feed of the past few seconds, continued it off w/ SORA, fed the continuation into a (relatively) simple policy translator that just pulled out what the video showed as player inputs, and the inputted that.
Presumably, this would work for non-minecraft applications, but Minecraft has a really standardized interface layer
This is the second Sora announcement I've seen. Am I missing how I can play with it? The examples in the papers are all well and good but I want to get my hands on it and try it.
I don't know if there is research into this, didn't see it mentioned here, but this is the most probable path to something like AI consciousness and AGI. Of course it's highly speculative but video to world simulation is how the brain evolved and probably what is needed to have a robot behave like a living being. It would just do this in reverse, video input to inner world model, and use that for reasoning about the world. Extremely fascinating, and also scary this is happening so quickly.
I think not in the short term- I'm guessing the next step will be to use traditional tools to make a "draft" of a desired video, then "finish it off" with this kind of deep learning tech.
So this tech will increase interest in existing 3d tools in the short term
People working in vfx are incredibly gloomy today, they see the writing on the wall now, whether it's 1 or 5 years. There will still be a demand for human-created stuff but many of the jobs in advertising and stock footage will disappear.
Arguably you should go long since once they integrate this into their products (as Adobe is doing) they have the distribution in place to monetise it, industry knowledge to combine it with existing workflows, etc.
As someone who's played probably too many hours of minecraft, these videos are nauseating. The way that all of the individual pieces exist, but have no consistency is terrifying. Random FoV changes, switching apparent texture packs, raytracing on or off, it's all still switching back and forth from moment to moment.
These videos honestly give me less confidence in the approach, simply because I don't know that the model will be able to distinguish these world model parameters as "unchanging".
The video with the two MTBs going downhill: it seems to me that the long left turn that begins a few second into the video is way too long. It's easy to misjudge that kind of things (try to draw a road race track by looking at a single lap of it) but it could end up below the point where it started, or too close to it to be physically realistic. I was expecting to see a right turn at any moment but it kept going left. It could be another consequence of the lack of real knowledge about the world, similar to the glass shattering example at the end of the article.
> We empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right)s have improved framing.
Every cv preprocessing pipeline is in shambles now.
The improvement to temporal consistency given that the length of these generated videos is 3 to 4 times longer than anything else on the market (runway, pika, etc) is truly remarkable.
Yeah I saw that too, it does it in one other place, the feet are also sort of gliding around. If you look at the people in the background a lot of them are doing the same, and there are other temporal-mechanical inconsistencies like joints inverting half way through a movement, i guess due to it operating in 2D so when things are foreshortened from the camera angle they have the opportunity to suddenly jump into the wrong position, like twitchy inverse kinematics.
Everything also has a sort of mushy feel about it, like nothing is anchored down, everything is swimming. Impressive all the same, but maybe these methods need to be paired with some old fashioned 3d rendering to serve as physically based guideline.
While the Sora videos are impressive, are these really world simulators? While some notion of real-world physics probably exists somewhere within the model, doesn’t all the completely artificial training data corrupt it?
Reasoning, logic, formal systems, and physics exist in a seemingly completely different, mathematical space than pure video.
This is just a contrived, interesting viewpoint of the technology, right?
> Reasoning, logic, formal systems, and physics exist in a seemingly completely different, mathematical space than pure video.
That's not true, AI systems in general have pretty strong mathematical proofs going back decades on what they can theoretically do, the problem is compute and general feasibility. AIXItl in theory would be able to learn reasoning, logic, formal systems, physics, human emotions, and a great deal of everything else just from watching videos. They would have to be videos of varied and useful things, but even if they were not, you'd at least get basic reasoning, logic, and physics.
This is a totally silly thought, but I still want to get it out there.
> Other interactions, like eating food, do not always yield correct changes in object state
Can this be because we just don't shoot a lot of people eating? I think it is general advice to not show people eating on camera for various reasons. I wonder if we know if that kind of topic bias exists in the dataset.
What makes OpenAI so far ahead of all of these other research firms (or even startups like Pika, Runway, etc.)? I feel like I see so many examples of fields where progress is being made all across and OpenAI suddenly swoops in with an insane breakthrough lightyears ahead of everyone else.
Is this generating videos as streaming content e.g. like a mp4 video. As far as I can see, it is doing that. Is it possible for AI to actually produce the 3d models?
What kind of compute resources are required to produce the 3d models.
This is some incredible and fascinating work! The applications seem endless.
1. High quality video or image from text
2. Taking in any content as input and generating forwards/backwards in time
3. Style transformation
4. Digital World simulation!
The current development of AI seems like speed run of Crystal Society in terms of their interaction with the world. The only thing missing is the Inner Purpose.
> Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
so they're gonna include the never-before-observed-but-predicted Unruh effect, as well? and other quantum theory? cool..
> For example, it does not accurately model the physics of many basic interactions, like glass shattering.
... oh
Isn't all of the training predicated on visible, gathered data - rather than theory? if so, I don't think it's right to call these things simulators of the physical world if they don't include physical theory.
DFT at least has some roots in theory.
I'm guessing that either the shots its trained on have a propensity to be slo-mo (like puppies playing in a field) or making it slow-motion makes unnatural movement a lot less obvious
Ugh, AI generated images everywhere is already annoying enough. Now we're gonna have these factitious videos clogging up everything, and I'll have to explain my old neighbor that Biden did infact not eat a fetus again and again.
100%. It's actually gotten even more dull once they started fixing the fingers. But it's too much; you start to realise that it's just so uninspired. Maybe what this will ultimately do is allow good writers to bring their ideas to life (I hope).
People are obviously already pointing out the errors in various physical interactions shown in the demo videos, including the research team themselves, and I think the plausiblity of the generated videos will likely improve as they work on the model more. However, I think the major reason this generation -> simulation leap might be harder leap than they think is actually a plausibility/accuracy distinction. Generative models are general and versatile compared to predictive models, but they're intrinsically learning an objective that assesses its extrapolations on spatial or sequential (or in the case of video, both) plausibility, which has a lot more degrees of freedom than accuracy. In other words, the ability to create reasonable-enough hypotheses for what the next frame or the next pixel over could end up not being enough. The optimistic scenario is that it's possible to get to a simulation by narrowing this hypothesis-space enough to accurately model reality. In other words, it's possible that this is just something that could fall out of the plausibility being continuously improved, like the subset of plausible hypotheses shrinks as the model gets better, and eventually we get a reality-predictor, but I think there are good reasons to think that's far from guaranteed. I'd be curious to see what happens if you restrict training data to unaltered camera footage rather than allowing anything fictitious, but the least optimistic possibility is that this kind of capability is necessary but not sufficient for adequate prediction (or slightly more optimistically, can only do so with amounts of resolution that are currently infeasible, or something).
Some of the reasons the less optimistic scenarios seem likely is that the kinds of extrapolation errors this model makes are of similar character to those of LLMs: extrapolation follows a gradient of smooth apparent transitions rather than some underlying logic about the objects portrayed, and sometimes seems to just sort of ignore situations that are far enough outside of what it's seen rather than reconcile them. For example, the tidal wave/historical hall example is a scenario unlikely to have been in the training data. Sure, there's the funny bit at the end where the surfer appears to levitate in the air, but there's a much larger issue with how these two contrasting scenes interact, or rather fail to. What we see looks a lot more like a scene of surfing superimposed via photoshop or something on a still image of the hall, as there's no evidence of the water interacting with the seats or walls in the hall at all. The model will just roll with whatever you tell it to do as best it can, but it's not doing something like modeling "what would happen if" that implausible scenario played out, and even doing it poorly would be a better sign for this doing something like "simulating" the described scenario. Instead, we have impressive results for prompts that likely strongly correspond to scenes the model may have seen, and evidence of a lack of composition in cases where a particular composition is unlikely to have been seen and needs some underlying understanding of how it "would" work that is visible to us
As if what you consume normally is actual real life. With blue screens, VFX, etc. you are already watching knockoffs of real life, and the shitty will become indistinguishable from reality before long.
empath-nirvana|2 years ago
Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.
You can probably already imagine different ways to wire the output to text generation and controlling its own motions, etc, and predicting outcomes based on actions it, itself could plausibly take, and choosing the best one.
It doesn't actually have to generate realistic imagery or imagery that doesn't have any mistakes or imagery that's high definition to be used in that way. How realistic is our own imagination of the world?
Edit: I'm going to add a specific case. Imagine a house cleaning robot. It starts with an image of your living room. Then it creates a image of your living room after it's been cleaned. Then it interpolates a video _imagining itself cleaning the room_, then acts as much as it can to mimic what's in the video, then generates a new continuation, then acts, and so on. Imagine doing that several times a second, if necessary.
margorczynski|2 years ago
Check out V-Jepa for such a system: https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-jo...
Buttons840|2 years ago
In theory, yes. The problem is we've had AGI many times before, in theory. For example, Q learning, feed the state of any game or system through a neural network, have it predict possible future rewards, iteratively improve the accuracy of the reward predictions, and boom, eventually you arrive at the optimal behavior for any system. We've know this since... the 70's maybe? I don't know how far Q-learning goes back.
I like to do experiments with reinforcement learning and it's always exciting to think "once I turn this thing on, it's going to work well and find lots of neat solutions to the problem", and the thing is, it's true, that might happen, but usually it doesn't. Usually I see some signs of learning, but it fails to come up with anything spectacular.
I keep watching for a strong AI in a video game like Civilization as a sign that AI can solve problems in a highly complex system while also being practical enough that game creators are able to implement it in a practical way. Yes, maybe, maybe, a team with experts could solve Civilization as a research project, but that's far from being practical. Do you think we'll be able to show an AI a video of people playing Civilization and have the video predict the best moves before the AI in the game is able to predict the best moves?
LarsDu88|2 years ago
Projecting into the future in 3d world space is actually what the endgame for robotics is and I imagine depending on how complex that 3d world model is, a working model for projecting into 3d space could be waaaaaay smaller.
It's just that the equivalent data is not as easily available on the internet :)
YeGoblynQueenne|2 years ago
As another comment points out that's Yann LeCun's idea of "Objective-Driven AI" introduced in [1] though not named that in the paper (LeCun has named it that way in talks and slides). LeCun has also said that this won't be achieved with generative models. So, either 1 out of 2 right, or both wrong, one way or another.
For me, I've been in AI long enough to remember many such breakthroughs that would lead to AGI before - from DeepBlue (actually) to CNNs, to Deep RL, to LLMs just now, etc. Either all those were not the breakthroughs people thought at the time, or it takes many more than an engineering breakthrough to get to AGI, otherwise it's hard to explain why the field keeps losing its mind about the Next Big Thing and then forgetting about it a few years later, when the Next Next Big Thing comes around.
But, enough with my cynicism. You think that idea can work? Try it out. In a simplified environment. Take some stupid grid world, a simplification of a text-based game like Nethack [2] and try to implement your idea, in-vitro, as it were. See how well it works. You could write a paper about it.
____________________
[1] https://openreview.net/pdf?id=BZ5a1r-kVsf
[2] Obviously don't start with Nethack itself because that's damn hard for "AI".
nopinsight|2 years ago
future successor to Sora + likely successor to GPT-4 = ASI
See my other comment here: https://news.ycombinator.com/item?id=39391971
jiggawatts|2 years ago
So right now, Sora is predicting "Hollywood style" content, with cuts, camera motions, etc... all much like what you'd expect to see in an edited film.
Nothing stops someone (including OpenAI) from training the same architecture with "real world captures".
Imagine telling a bunch of warehouse workers that for "safety" they all need to wear a GoPro-like action camera on their helmets that record everything inside the work area. Run that in a bunch of warehouses with varying sizes, content, and forklifts, and then pump all of that through this architecture to train it. Include the instructions given to the staff from the ERP system as well as the transcribed audio as the text prompt.
Ta-da.
You have yourself an AI that can control a robot using the same action camera as its vision input. It will be able to follow instructions from the ERP, listen to spoken instructions, and even respond with a natural voice. It'll even be able to handle scenarios such as spills, breaks, or other accidents... just like the humans in its training data did. This is basically what vehicle auto-pilots do, but on steroids.
Sure, the computer power required for this is outrageously expensive right now, but give it ten to twenty years and... no more manual labour.
verticalscaler|2 years ago
A Pacman bot is not AGI. Might get it to eat all the dots correctly where as before if something scrolled off the screen it'd forget about it and glitch out - but you didn't fan any flames of consciousness into existence as of yet.
coffeebeqn|2 years ago
zmgsabst|2 years ago
Imagine where you want to be (eg, “I scored a goal!”) from where you are now, visualize how you’ll get there (eg, a trick and then a shot), then do that.
adi_pradhan|2 years ago
It'll be much harder for more open ended world problems where the physics encountered may be rare enough in the dataset that the simulation breaks unexpectedly. For example a glass smashing into the floor. The model doesn't simulate that causally afaik
therein|2 years ago
There was that article a few months ago about how basically that's what the cerebellum does.
mdorazio|2 years ago
neom|2 years ago
(and throw this in for good measure https://www.wired.com/story/this-lab-grown-skin-could-revolu... heh)
liamYC|2 years ago
metabagel|2 years ago
pyinstallwoes|2 years ago
deadbabe|2 years ago
staring at a painting in a Museum
Then immediately jumping into an entire VR world based off the painting generated by an AI rendering it out on the fly
blueprint|2 years ago
aurareturn|2 years ago
SushiHippie|2 years ago
For example, the surfer is surfing in the air at the end:
https://cdn.openai.com/tmp/s/prompting_7.mp4
Or this "breaking" glass that does not break, but spills liquid in some weird way:
https://cdn.openai.com/tmp/s/discussion_0.mp4
Or the way this person walks:
https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-a...
Or wherever this map is coming from:
https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls...
chkaloon|2 years ago
danans|2 years ago
> https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls...
Notice also that that a roughly 6 seconds there is a third hand putting the map away.
mr_toad|2 years ago
Maybe it’s been watching snowboarding videos and doesn’t quite understand the difference.
SiempreViernes|2 years ago
> https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-a...
Also, why does she have a umbrella sticking out from her lower back?
sega_sai|2 years ago
hackerlight|2 years ago
coffeebeqn|2 years ago
modeless|2 years ago
So this is why they haven't shown Will Smith eating spaghetti.
> These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world
This is exciting for robotics. But an even closer application would be filling holes in gaussian splatting scenes. If you want to make a 3D walkthrough of a space you need to take hundreds to thousands of photos with seamless coverage of every possible angle, and you're still guaranteed to miss some. Seems like a model this capable could easily produce plausible reconstructions of hidden corners or close up detail or other things that would just be holes or blurry parts in a standard reconstruction. You might only need five or ten regular photos of a place to get a completely seamless and realistic 3D scene that you could explore from any angle. You could also do things like subtract people or other unwanted objects from the scene. Such an extrapolated reconstruction might not be completely faithful to reality in every detail, but I think this could enable lots of applications regardless.
SiempreViernes|2 years ago
unknown|2 years ago
[deleted]
nopinsight|2 years ago
Why superhuman? Larger context length than our working memory is an obvious one, but there will likely be other advantages such as using alternative sensory modalities and more granular simulation of details unfamiliar to most humans.
Nathanba|2 years ago
roenxi|2 years ago
guybedo|2 years ago
The results really are impressive. Being able to generate such high quality videos, to extend videos in the past and in the future shows how much the model "understands" the real world, objects interaction, 3D composition, etc...
Although image generation already requires the model to know a lot about the world, i think there's really a huge gap with video generation where the model needs to "know" 3D, objects movements and interactions.
iliane5|2 years ago
I can't wait to play with this but I can't even imagine how expensive it must be. They're training in full resolution and can generate up to a minute of video.
Seeing how bad video generation was, I expected it would take a few more years to get to this but it seems like this is another case of "Add data & compute"(TM) where transformers prove once again they'll learn everything and be great at it
data-ottawa|2 years ago
The robot examples are very underwhelming, but the people and background people are all very well done, and at a level much better than most static image diffusion models produce. Generating the same people as the interact with objects is also not something I expected a model like this to do well so soon.
lairv|2 years ago
Nihilartikel|2 years ago
The crazy thing is that they do it by prompting the model to in paint a chrome sphere into the middle of the image to reflect what is behind the camera! The model can interpret the context and dream up what is plausibly in the whole environment.
larschdk|2 years ago
It is impressive at a glance, but if you pay attention, it is more like dreaming than realism (images conjured out of other images, without attention to long term temporal, spatial, and causal consistency). I'm hardly more impressed that Google's deep dream, which is 10 years old.
crooked-v|2 years ago
nodja|2 years ago
[1] https://dreamfusion3d.github.io/
TOMDM|2 years ago
pedrovhb|2 years ago
anonyfox|2 years ago
I know its a delicate topic people (especially in the US) don't want to speak about at all, but damn, this is a giant market and could do humanity good if done well.
kenning|2 years ago
michalf6|2 years ago
Producing a neverending supply of wireheading-like addictive stimuli is the farthest possible thing from a good for humanity.
Want to do good in this area - work on ways to limit consumption.
zone411|2 years ago
https://twitter.com/LechMazur/status/1607929403421462528
https://twitter.com/LechMazur/status/1619032477951213568
dang|2 years ago
Sora: Creating video from text - https://news.ycombinator.com/item?id=39386156 - Feb 2024 (1430 comments)
GaggiX|2 years ago
kevingadd|2 years ago
I could imagine a text adventure game with a 'visual' component perhaps working if you got the model to maintain enough consistency in spaces and character appearances.
Jordan-117|2 years ago
https://www.youtube.com/watch?v=udPY5rQVoW0
koonsolo|2 years ago
Plus, it could generate it in real time and take my responses into account. I look bored? Spice it up, etc.
Today such a thing seems closer than I thought.
binary132|2 years ago
lanternfish|2 years ago
Presumably, this would work for non-minecraft applications, but Minecraft has a really standardized interface layer
chankstein38|2 years ago
proc0|2 years ago
myth_drannon|2 years ago
drcode|2 years ago
So this tech will increase interest in existing 3d tools in the short term
hackerlight|2 years ago
nojs|2 years ago
colesantiago|2 years ago
Edit, changed the links to the direct ones!
https://cdn.openai.com/tmp/s/simulation_6.mp4
https://cdn.openai.com/tmp/s/simulation_7.mp4
cptroot|2 years ago
These videos honestly give me less confidence in the approach, simply because I don't know that the model will be able to distinguish these world model parameters as "unchanging".
SushiHippie|2 years ago
example video links from TFA:
https://cdn.openai.com/tmp/s/simulation_6.mp4
https://cdn.openai.com/tmp/s/simulation_7.mp4
pmontra|2 years ago
htrp|2 years ago
Every cv preprocessing pipeline is in shambles now.
vunderba|2 years ago
sjwhevvvvvsj|2 years ago
Still, god damn.
tomxor|2 years ago
Everything also has a sort of mushy feel about it, like nothing is anchored down, everything is swimming. Impressive all the same, but maybe these methods need to be paired with some old fashioned 3d rendering to serve as physically based guideline.
danavar|2 years ago
Reasoning, logic, formal systems, and physics exist in a seemingly completely different, mathematical space than pure video.
This is just a contrived, interesting viewpoint of the technology, right?
mr_toad|2 years ago
Vecr|2 years ago
That's not true, AI systems in general have pretty strong mathematical proofs going back decades on what they can theoretically do, the problem is compute and general feasibility. AIXItl in theory would be able to learn reasoning, logic, formal systems, physics, human emotions, and a great deal of everything else just from watching videos. They would have to be videos of varied and useful things, but even if they were not, you'd at least get basic reasoning, logic, and physics.
newswasboring|2 years ago
> Other interactions, like eating food, do not always yield correct changes in object state
Can this be because we just don't shoot a lot of people eating? I think it is general advice to not show people eating on camera for various reasons. I wonder if we know if that kind of topic bias exists in the dataset.
anirudhv27|2 years ago
pellucide|2 years ago
Is this generating videos as streaming content e.g. like a mp4 video. As far as I can see, it is doing that. Is it possible for AI to actually produce the 3d models?
What kind of compute resources are required to produce the 3d models.
andybak|2 years ago
https://twitter.com/BenMildenhall/status/1758224827788468722
https://twitter.com/ScottieFoxTTV/status/1758272455603327455
The key is that the video has spatial consistency. Once you've got that, then other existing tech can take the output and infer actual spatial forms.
jk_tech|2 years ago
1. High quality video or image from text 2. Taking in any content as input and generating forwards/backwards in time 3. Style transformation 4. Digital World simulation!
exe34|2 years ago
neurostimulant|2 years ago
lbrito|2 years ago
shiroiushi|2 years ago
mr_toad|2 years ago
blueprint|2 years ago
so they're gonna include the never-before-observed-but-predicted Unruh effect, as well? and other quantum theory? cool..
> For example, it does not accurately model the physics of many basic interactions, like glass shattering.
... oh
Isn't all of the training predicated on visible, gathered data - rather than theory? if so, I don't think it's right to call these things simulators of the physical world if they don't include physical theory. DFT at least has some roots in theory.
FeepingCreature|2 years ago
liuliu|2 years ago
yakito|2 years ago
93po|2 years ago
tokai|2 years ago
SuaveSteve|2 years ago
advael|2 years ago
Some of the reasons the less optimistic scenarios seem likely is that the kinds of extrapolation errors this model makes are of similar character to those of LLMs: extrapolation follows a gradient of smooth apparent transitions rather than some underlying logic about the objects portrayed, and sometimes seems to just sort of ignore situations that are far enough outside of what it's seen rather than reconcile them. For example, the tidal wave/historical hall example is a scenario unlikely to have been in the training data. Sure, there's the funny bit at the end where the surfer appears to levitate in the air, but there's a much larger issue with how these two contrasting scenes interact, or rather fail to. What we see looks a lot more like a scene of surfing superimposed via photoshop or something on a still image of the hall, as there's no evidence of the water interacting with the seats or walls in the hall at all. The model will just roll with whatever you tell it to do as best it can, but it's not doing something like modeling "what would happen if" that implausible scenario played out, and even doing it poorly would be a better sign for this doing something like "simulating" the described scenario. Instead, we have impressive results for prompts that likely strongly correspond to scenes the model may have seen, and evidence of a lack of composition in cases where a particular composition is unlikely to have been seen and needs some underlying understanding of how it "would" work that is visible to us
bawana|2 years ago
stephenitis|2 years ago
Also the concept of learning to simulating the world seems more important than just the media and content implications.
RayVR|2 years ago
danielbln|2 years ago
andybak|2 years ago
Also - ironic choice of username considering this comment!