I haven't see any details on how OpenAI's model works, but the tokens it generates aren't directly translated into pixels - those tokens are probably fed into a diffusion process which generates the actual image.. The tokens are the latent space or conditioning for the actual image generation process.
bonoboTP|10 months ago
Exactly. People just confidently make things up. There are many possible ways, and without details, "native generation" is just a marketing buzzword without clear definition. It's a proprietary system, there is no code release, there is no publication. We simply don't know how exactly it's done.
famouswaffles|10 months ago
It's probably an implementation of VAR (https://arxiv.org/abs/2404.02905) - autoregressive image generation with a small twist. Rather than predict every token at the target resolution directly, start with predicting it at a small resolution, cranking it higher and higher until the desired resolution.