(no title)
_Nat_ | 2 years ago
It doesn't think like that.
If it did, they could've just done `P(hasFiveFingersPerHand)=0.99999`.
But it doesn't even necessarily draw what you ask it to. Instead, it generally adopts a set of de-noising transforms that it's been trained to believe would tend to lead to what the prompt sounds like.. then whatever those transforms produce would, hopefully, be sorta like what was requested.
Der_Einzige|2 years ago
https://colab.research.google.com/drive/1dlgggNa5Mz8sEAGU0wF...
You can see them define a custom color loss and apply it simultaneously with the regular diffusion loss. I've actually expanded this notebook to allow regional specification of the custom loss.
It's quite difficult to define a function that detects if an individual has 5 fingers or not. That's the real issue.
_Nat_|2 years ago
My point was that it doesn't actually think like that. For example, prompting StableDiffusion for a picture of a doctor doesn't necessarily get it to draw a human at all, much less a doctor of a pre-determined sex; instead, StableDiffusion de-noises the image until the result emerges, where that result would (ideally) contain a doctor of whatever sex it happened to come up with.
That said, you're right that we can add more code to try to guide things.
We could even just brute-force it by just re-generating images over-and-over, or tweaking them after generation, until they match exactly what we wanted. (Realistically, something like branch-and-bound would probably be preferred to blindly guess-and-check-ing.)