top | item 44799022

(no title)

ollin | 6 months ago

This is very encouraging progress, and probably what Demis was teasing [1] last month. A few speculations on technical details based on staring at the released clips:

1. You can see fine textures "jump" every 4 frames - which means they're most likely using a 4x-temporal-downscaling VAE with at least 4-frame interaction latency (unless the VAE is also control-conditional). Unfortunately I didn't see any real-time footage to confirm the latency (at one point they intercut screen recordings with "fingers on keyboard" b-roll? hmm).

2. There's some 16x16 spatial blocking during fast motion which could mean 16x16 spatial downscaling in the VAE. Combined with 1, this would mean 24x1280x720/(4x16x16) = 21,600 tokens per second, or around 1.3 million tokens per minute.

3. The first frame of each clip looks a bit sharper and less videogamey than later stationary frames, which suggests this is could be a combination of text-to-image + image-to-world system (where the t2i system is trained on general data but the i2w system is finetuned on game data with labeled controls). Noticeable in e.g. the dirt/textures in [2]. I still noticed some trend towards more contrast/saturation over time, but it's not as bad as in other autoregressive video models I've seen.

[1] https://x.com/demishassabis/status/1940248521111961988

[2] https://deepmind.google/api/blob/website/media/genie_environ...

discuss

order

ollin|6 months ago

Regarding latency, I found a live video of gameplay here [1] and it looks like closer to 1.1s keypress-to-photon latency (33 frames @ 30fps) based on when the onscreen keys start lighting up vs when the camera starts moving. This writeup [2] from someone who tried the Genie 3 research preview mentions that "while there is some control lag, I was told that this is due to the infrastructure used to serve the model rather than the model itself" so a lot of this latency may be added by their client/server streaming setup.

[1] https://x.com/holynski_/status/1952756737800651144

[2] https://togelius.blogspot.com/2025/08/genie-3-and-future-of-...

rotexo|6 months ago

You know that thing in anxiety dreams where you feel very uncoordinated and your attempts to manipulate your surroundings result in unpredictable consequences? Like you try to slam on the brake pedal but your car doesn’t slow down, or you’re trying to get a leash on your dog to lead it out of a dangerous situation and you keep failing to hook it on the collar? Maybe that’s extra latency because your brain is trying to render the environment at the same time as it is acting.

blibble|6 months ago

> I found a live video of gameplay here [1] and it looks like closer to 1.1s keypress-to-photon latency (33 frames @ 30fps) based on when the onscreen keys start lighting up vs when the camera starts moving.

so better than Stadia?