(no title)
spi | 1 year ago
As jzbontar below mentions, the crucial point is that the random noise mask is the same. The diffusion models are trained to turn random noise to an image, and they are deterministic at that - the same noise leads to the same image.
What the authors did here was to find a smart way of training a new model able to "simulate" in a single step what diffusion achieves in many; to do so, they took many triplets of (prompt, noise, image) generated starting from random noise and a (fixed) pretrained stable diffusion checkpoint. The model is trained to replicate the results.
So, it is surprising that this works at all at creating meaningful images, but it would be _really_ surprising (i.e. probably impossible) if it generated meaningful images which were seriously different from the ones it was pretrained with!
albert_e|1 year ago
Pardon my ignorance ...
Does MIT model then not work as a general text-to-image model to generate novel images based on arbitrary new text prompts that it has not seen before?
spi|1 year ago
My understanding is that this paper by MIT doesn't train any new model from scratch. I takes a pretrained model (e.g. StableDiffusion), which however is trained to do "a small step" only: you fix a number of steps (e.g. 1000 in the MIT paper), and ask the model to predict how to "enhance" an image by a certain step (e.g. of size 1/1000); the constants are adjusted so that, if the model is "perfect", you get from pure white noise to an image in the exact number of steps you set. If I remember correctly how diffusion works, in theory you could set this number to any value, including 1, but in practice you need several hundreds to get a good result, i.e. the original StableDiffusion model is only able to fit a small adjustment.
This new paper shows how to "distil" the original model (in this case, StableDiffusion) into another model. However, unlike typical distillation, which is used to compress a big model into a smaller one, in this case the distilled model is basically the same as the one you start with; but it has been trained with a different objective, namely to transform random noise to the prediction that the original model (StableDiffusion) would make in 1000 steps. To do so, it is trained on a very large amount of triples (text, noise, image). But I don't think you can incorporate into this training procedure other "real" images that are not generated by the model you start with, because you don't have a corresponding noise (abstractly, there is no such concept as "corresponding noise" to a given image, because the relation noise -> image depends on the specific model you start with, and this map is not anywhere near invertible, since not all images can be generated by StableDiffusion, or any other model).
Once the model is trained, you can of course give it a new prompt and, in theory, it should generate something rather similar to what StableDiffusion would generate with the same prompt (hopefully, the example displayed on their web page are not from the training set! Otherwise it would be totally useless). But you should never obtain something "totally different" from what StableDiffusion would give you, so in that sense it's not "general", it is "just" a model that imitates StableDiffusion very well while being much faster. Which is already great of course :-)