top | item 47198866

(no title)

If the prompt is the compass, and represents a point in space, why walk there? Why not just go to that point in image space directly, what would be there? When does the random seed matter if you're aiming at the same point anyway, don't you end up there? Does the prompt vector not exist in the image manifold, or is there some local sampling done to pick images which are more represented in the training data?

discuss

whilefalse|1 day ago

So I’m not an expert, this post was just based on my understanding, but as I understand it: the prompt embedding space and the latent image space are different “spaces”, so there is no single “point” in the latent image space that represents a given prompt. There are regions that are more or less consistent with the prompt, and due to cross-attention between the text embedding vector and the latent image vector, it’s able to guide the diffusion process in a suitable direction.

So different seeds lead to slightly different end points, because you’re just moving closer to the “consistent region” at each step, but approaching from a different angle.

ainch|1 day ago

One way of thinking about diffusion is that you're learning a velocity field from unlikely to likely images in the latent space, and that field changes depending on your conditioning prompt. You start from a known starting point (a noise distribution), and then take small steps following the velocity field, eventually ending up at a stable endpoint (which corresponds to the final image). Because your starting point is a random sample from a noise distribution, if you pick a slightly different starting point (seed), you'll end up at a slightly different endpoint.

You can't jump to the endpoint because you don't know where it is - all you can compute is 'from where I am, which direction should my next step be.' This is also why the results for few-step diffusion are so poor - if you take big jumps over the velocity field you're only going in approximately the right direction, so you won't end up at a properly stable point which corresponds to a "likely" image.