Has there been research on using this to make models smaller? If models converge on similar representations, we should be able to build more efficient architectures around those core features.
It's more likely that such an architecture would be bigger rather than smaller. https://arxiv.org/abs/2412.20292 demonstrated that score-matching diffusion models approximate a process that combines patches from different training images. To build a model that makes use of this fact, all you need to do is look up the right patch in the training data. Of course a model the size of its training data would typically be rather unwieldy to use. If you want something smaller, we're back to approximations created by training the old-fashioned way.
I have mixed feelings about this interpretation: that diffusion models approximately produce moseics from patches of training data. It does a good job helping people understand why diffusion models are able to work. I used it myself in talk almost 3 years ago! And it isn't a lie exactly, the linked paper is totally sound. It's just that it only works if you assume your model is an absolute optimal minimization of the loss (under some inductive biases). It isn't. No machine learning more complicated than OLS holds up to that standard.
_And that's the actual reason they work._ Undefit models don't just approximate, they interpolate, extrapolate, generalize a bit, and ideally smooth out the occasional total garbage mixed in with your data. In fact, diffusion models work so well because they can correct their own garbage! If extra fingers start to show up in step 5, then steps 6 and 7 still have a chance to reinterpret that as noise and correct back into distribution.
And then there's all the stuff you can do with diffusion models. In my research I hack into the model and use it to decompose images into the surface material properties and lighting! That doesn't make much sense as averaging of memorized patches.
Given all that, it is a very useful interpretation. But I wouldn't take it too literally.
I've been thinking about this a lot. I want to know what's the smallest a model needs to be, before letting it browse search engines, or files you host locally is actually an avenue an LLM can go through to give you more informed answers. Is it 2GB? 8GB? Would love to know.
yorwba|7 months ago
samsartor|7 months ago
_And that's the actual reason they work._ Undefit models don't just approximate, they interpolate, extrapolate, generalize a bit, and ideally smooth out the occasional total garbage mixed in with your data. In fact, diffusion models work so well because they can correct their own garbage! If extra fingers start to show up in step 5, then steps 6 and 7 still have a chance to reinterpret that as noise and correct back into distribution.
And then there's all the stuff you can do with diffusion models. In my research I hack into the model and use it to decompose images into the surface material properties and lighting! That doesn't make much sense as averaging of memorized patches.
Given all that, it is a very useful interpretation. But I wouldn't take it too literally.
giancarlostoro|7 months ago