The title is not wrong, but it also doesn't feel correct either. What they do here is they use a pre-trained model to guide the training of a 2nd model. Of course, that massively speeds up training of the 2nd model. But it's not like you can now train a diffusion model from scratch 20x faster. Instead, this is a technique for transplanting an existing model onto a different architecture so that you don't have to start training from 0.
It does feel right to me, because it's not distilling the second model, and in fact the second model is not an image generation model at all, but a visual encoder. That is, it's a more "general purpose" model which specializes in extracting semantic information from images.
In hindsight it makes total sense - generative image models don't automatically start out with an idea of semantic meaning or the world, and so they have to implicitly learn one during training. That's a hard task by itself, and it's not specifically trained for this task, but rather learns it on the go at the same time as the network learns to create images. The idea of the paper then is to provide the diffusion model with a preexisting concept of the world by nudging its internal representations to be similar to the visual encoders'. As I understand DINO isn't even used during inference after the model is ready, it's just about representations.
I wouldn't at all describe it as "a technique for transplanting an existing model onto a different architecture". It's different from distillation because again, DINO isn't an image generation model at all. It's more like (very roughly simplifying for the sake of analogy) instead of teaching someone to cook from scratch, we're starting with a chef who already knows all about ingredients, flavors, and cooking techniques, but hasn't yet learned to create dishes. This chef would likely learn to create new recipes much faster and more effectively than someone starting from zero knowledge about food. It's different from telling them to just copy another chef's recipes.
Yes, now it seems obvious, but before this it wasn't clear that that would be something that could speed things up, due to the fact that the pretrained model was trained on a separate objective. It's a brilliant idea that works amazingly.
Diffusion models are already being evaluated using pretrained SSL models à la CLIP Score [1]. So it makes sense that one would incorporate that directly into training the model from scratch.
I wonder how well this technique works if the distribution of the training dataset between the diffusion model and the image encoder is quite different, for example if you use DinoV2 as the encoder but train the diffusion model on anime.
So I can't find that paper that was posted on HN that said that, when viewed under the right theoretical framework, asserts that diffusion and transformers are doing the same thing under a different basis.. am I misrembering something?
fxtentacle|1 year ago
pedrovhb|1 year ago
In hindsight it makes total sense - generative image models don't automatically start out with an idea of semantic meaning or the world, and so they have to implicitly learn one during training. That's a hard task by itself, and it's not specifically trained for this task, but rather learns it on the go at the same time as the network learns to create images. The idea of the paper then is to provide the diffusion model with a preexisting concept of the world by nudging its internal representations to be similar to the visual encoders'. As I understand DINO isn't even used during inference after the model is ready, it's just about representations.
I wouldn't at all describe it as "a technique for transplanting an existing model onto a different architecture". It's different from distillation because again, DINO isn't an image generation model at all. It's more like (very roughly simplifying for the sake of analogy) instead of teaching someone to cook from scratch, we're starting with a chef who already knows all about ingredients, flavors, and cooking techniques, but hasn't yet learned to create dishes. This chef would likely learn to create new recipes much faster and more effectively than someone starting from zero knowledge about food. It's different from telling them to just copy another chef's recipes.
byyoung3|1 year ago
zaptrem|1 year ago
viktour19|1 year ago
[1] https://huggingface.co/docs/diffusers/en/conceptual/evaluati...
GaggiX|1 year ago
gdiamos|1 year ago
nextaccountic|1 year ago
kleiba|1 year ago