(no title)
idiliv | 2 years ago
Based on my experience doing research on Stable Diffusion, scaling up the resolution is the conceptually easy part that only requires larger models and more high-resolution training data.
The hard part is semantic alignment with the prompt. Attempts to scale Stable Diffusion, like SDXL, have resulted only in marginally better prompt understanding (likely due to the continued reliance on CLIP prompt embeddings).
So, the key question here is how well Sora does prompt alignment.
golol|2 years ago
kolja005|2 years ago
nimbleal|2 years ago
I think one part of the problem is using English (or whatever natural language) for the prompts/training. Too much inherent ambiguity. I’m interested to see what tools (like control nets with SD) are developed to overcome this.
unknown|2 years ago
[deleted]