top | item 39112100

(no title)

Birch-san | 2 years ago

> Is it super resolution?

nope, we don't do Imagen-style super-resolution. we go direct to high resolution with a single-stage model.

discuss

sorenjan|2 years ago

I was referring to the input image in the diagram, what is that and how is the output image generated from it? Is it 256x256 noise that gets denoised into an image? I guess what I'm really asking is what guides the process into the final image if it's not text to image?

stefanbaumann|2 years ago

The "input image" is just the noisy sample from the previous timestep, yes.

The overall architecture diagram does not explicitly show the conditioning mechanism, which is a small separate network. For this paper, we only trained on class-conditional ImageNet and completely unconditional megapixel-scale FFHQ.

Training large-scale text-to-image models with this architecture is something we have not yet attempted, although there's no indication that this shouldn't work with a few tweaks.