top | item 41375719

(no title)

wkcheng | 1 year ago

It's insane that that this works, and that it works fast enough to render at 20 fps. It seems like they almost made a cross between a diffusion model and an RNN, since they had to encode the previous frames and actions and feed it into the model at each step.

Abstractly, it's like the model is dreaming of a game that it played a lot of, and real time inputs just change the state of the dream. It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

discuss

lokimedes|1 year ago

It makes good sense for humans to have this ability. If we flip the argument, and see the next frame as a hypothesis for what is expected as the outcome of the current frame, then comparing this "hypothesis" with what is sensed makes it easier to process the differences, rather than the totality of the sensory input.

As Richard Dawkins recently put it in a podcast[1], our genes are great prediction machines, as their continued survival rests on it. Being able to generate a visual prediction fits perfectly with the amount of resources we dedicate to sight.

If that is the case, what does aphantasia tell us?

[1] https://podcasts.apple.com/dk/podcast/into-the-impossible-wi...

dbspin|1 year ago

Worth noting that aphantasia doesn't necessarily extend to dreams. Anecdotally - I have pretty severe aphantasia (I can conjure milisecond glimpses of barely tangible imagery that I can't quite perceive before it's gone - but only since learning that visualisation wasn't a linguistic metaphor). I can't really simulate object rotation. I can't really 'picture' how things will look before they're drawn / built etc. However I often have highly vivid dream imagery. I also have excellent recognition of faces and places (e.g.: can't get lost in a new city). So there clearly is a lot of preconscious visualisation and image matching going on in some aphantasia cases, even where the explicit visual screen is all but absent.

jonplackett|1 year ago

What’s the aphantasia link? I’ve got aphantasia. I’m convinced though that the bit of my brain that should be making images is used for letting me ‘see’ how things are connected together very easily in my head. Also I still love games like Pictionary and can somehow draw things onto paper than I don’t really know what they look like in my head. It’s often a surprise when pen meets paper.

quickestpoint|1 year ago

As Richard Dawkins theorized, would be more accurate and less LLM like :)

nsbk|1 year ago

We are. At least that's what Lisa Feldman Barrett [1] thinks. It is worth listening to this Lex Fridman podcast: Counterintuitive Ideas About How the Brain Works [2], where she explains among other ideas how constant prediction is the most efficient way of running a brain as opposed to reaction. I never get tired of listening to her, she's such a great science communicator.

[1] https://en.wikipedia.org/wiki/Lisa_Feldman_Barrett

[2] https://www.youtube.com/watch?v=NbdRIVCBqNI&t=1443s

PunchTornado|1 year ago

Interesting talk about the brain, but the stuff she says about free will is not a very good argument. Basically it is sort of the argument that the ancient greeks made which brings the discussion into a point where you can take both directions.

stevenhuang|1 year ago

> It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

Yup, see https://en.wikipedia.org/wiki/Predictive_coding

quickestpoint|1 year ago

Umm, that’s a theory.

bangaladore|1 year ago

> It's insane that that this works, and that it works fast enough to render at 20 fps.

It is running on an entire v5 TPU (https://cloud.google.com/blog/products/ai-machine-learning/i...)

It's unclear how that compares to a high-end consumer GPU like a 3090, but they seem to have similar INT8 TFLOPS. The TPU has less memory (16 vs. 24), and I'm unsure of the other specs.

Something doesn't add up, in my opinion, though. SD usually takes (at minimum) seconds to produce a high-quality result on a 3090, so I can't comprehend how they are like 2 orders of magnitudes faster—indicating that the TPU vastly outperforms a GPU for this task. They seem to be producing low-res (320x240) images, but it still seems too fast.

Philpax|1 year ago

There's been a lot of work in optimising inference speed of SD - SD Turbo, latent consistency models, Hyper-SD, etc. It is very possible to hit these frame rates now.

dartos|1 year ago

> It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

This, to me, seems extremely reductionist. Like you start with AI and work backwards until you frame all cognition as next something predictors.

It’s just the stochastic parrot argument again.

wrsh07|1 year ago

Makes me wonder when an update to the world models paper comes out where they drop in diffusion models: https://worldmodels.github.io/

Teever|1 year ago

Also recursion and nested virtualization. We can dream about dreaming and imagine different scenarios, some completely fictional or simply possible future scenarios all while doing day to day stuff.

mensetmanusman|1 year ago

Penrose (Nobel prize in physics) stipulates that quantum effects in the brain may allow a certain amount of time travel and back propagation to accomplish this.

wrsh07|1 year ago

You don't need back propagation to learn

This is an incredibly complex hypothesis that doesn't really seem justified by the evidence

richard___|1 year ago

Did they take in the entire history as context?

slashdave|1 year ago

Image is 2D. Video is 3D. The mathematical extension is obvious. In this case, low resolution 2D (pixels), and the third dimension is just frame rate (discrete steps). So rather simple.

Sharlin|1 year ago

This is not "just" video, however. It's interactive in real time. Sure, you can say that playing is simply video with some extra parameters thrown in to encode player input, but still.

InDubioProRubio|1 year ago

Video is also higher resolution, as the pixels flip for the high resolution world by moving through it. Swivelling your head without glasses, even the blurry world contains more information in the curve of pixelchange.