top | item 43620390

(no title)

WhiteNoiz3 | 10 months ago

I haven't see any details on how OpenAI's model works, but the tokens it generates aren't directly translated into pixels - those tokens are probably fed into a diffusion process which generates the actual image.. The tokens are the latent space or conditioning for the actual image generation process.

discuss

bonoboTP|10 months ago

> I haven't see any details on how OpenAI's model works

Exactly. People just confidently make things up. There are many possible ways, and without details, "native generation" is just a marketing buzzword without clear definition. It's a proprietary system, there is no code release, there is no publication. We simply don't know how exactly it's done.

famouswaffles|10 months ago

Open AI have both said it's native image generation and autoregressive. It has the signs of it too.

It's probably an implementation of VAR (https://arxiv.org/abs/2404.02905) - autoregressive image generation with a small twist. Rather than predict every token at the target resolution directly, start with predicting it at a small resolution, cranking it higher and higher until the desired resolution.