top | item 41376371

(no title)

mo_42 | 1 year ago

An implementation of the game engine in the model itself is theoretically the most accurate solution for predicting the next frame.

I'm wondering when people will apply this to other areas like the real world. Would it learn the game engine of the universe (ie physics)?

discuss

radarsat1|1 year ago

There has definitely been research for simulating physics based on observation, especially in fluid dynamics but also for rigid body motion and collision. It's important for robotics applications actually. You can bet people will be applying this technique in those contexts.

I think for real world application one challenge is going to be the "action" signal which is a necessary component of the conditioning signal that makes the simulation reactive. In video games you can just record the buttons, but for real world scenarios you need difficult and intrusive sensor setups for recording force signals.

(Again for robotics though maybe it's enough to record the motor commands, just that you can't easily record the "motor commands" for humans, for example)

cubefox|1 year ago

A popular theory in neuroscience is that this is what the brain does:

https://slatestarcodex.com/2017/09/05/book-review-surfing-un...

It's called predictive coding. By trying to predict sensory stimuli, the brain creates a simplified model of the world, including common sense physics. Yann LeCun says that this is a major key to AGI. Another one is effective planning.

But while current predictive models (autoregressive LLMs) work well on text, they don't work well on video data, because of the large outcome space. In an LLM, text prediction boils down to a probability distribution over a few thousand possible next tokens, while there are several orders of magnitude more possible "next frames" in a video. Diffusion models work better on video data, but they are not inherently predictive like causal LLMs. Apparently this new Doom model made some progress on that front though.

ccozan|1 year ago

Howver, this is due how we actually digitize video. From a human point a view, looking in my room reduces the load to the _objects_ in the room and everyhing else is just noise ( like the color of the wall could be just a single item to remember, while otherwise in the digital world, it needs to remember all the pixels )