(no title)
mota7 | 3 years ago
You could also think of this as: We start with a terrible signal to noise ratio. So we need to average over very large areas to get any reasonable signal. But as we increase the signal, we can average over a smaller area to get the same signal-to-ratio.
In the beginning, we're averaging over large areas, so all the fine detail is lost. We just get 'might be a dog? maybe??'. What the network is doing is saying "if this a dog, there should be a head somewhere over here. So let me make it more like a head". Which improves the signal to noise ratio a bit.
After a few more steps, the signal is strong enough that we can get sufficient signal from smaller areas, so it starts saying 'head of a dog' in places. So the network will then start doing "Well, if this is a dog's head, there should be some eyes. Maybe two, but probably not three. And they'll be kinda somewhere around here".
Why do it this way?
Doing it this ways means the network doesn't need to learn "Here are all the ways dogs can look". Instead, it can learn a factored representation: A dog has a head and a body. The network only needs to learn a very fuzzy representation at this level. Then a head has some eyes and maybe a nose. Again, it only needs to learn a very fuzzy representation and (very) rough relative locations.
So it only when it get right down into fine detail that it actually needs to learn pixel perfect representation. But this is _way_ easier, because in small areas images have surprisingly very low entropy.
The 'text-to-image' bit is a just a twist on the basic idea. At the start when the network is going "dog? or it might be a horse?", we fiddle with the probabilities a bit so that the network starts out convinced there's a dog in there somewhere. At which point it starts making the most likely places look a little more like a dog.
sroussey|3 years ago