top | item 47209871

(no title)

vunderba | 10 hours ago

Yeah I think that it's part of the issue with a single "squashed" comparative metric. Some users are going to grade higher based on the overall visual fidelity and others are going to value following the prompt.

For a point of reference, I run a pretty comprehensive image model comparison site heavily weighted in favor of prompt adherence.

https://genai-showdown.specr.net

EDIT: FWIW, I agree with your assessment. OpenAI's models have always been very strong in prompt adherence but visually weak (gpt-image-1 had the famous "piss filter" until they finally pushed out gpt-image-1.5)

discuss

vtail|9 hours ago

Very cool site - I think I saw it before here on HN, and I liked it a lot.

Did you manually review all the edit results manually yourself, or do you have some kind of automated procedure?

vunderba|9 hours ago

Thanks. So I have a bespoke python program that basically does this:

- Takes the platonic set of prompts

- Uses model specific tuning directives with LLMs to create a bunch of prompt variations so that they get a diverse set of natural language expressions to "roll" generations

But I still have to manually review each of the final image - which is pretty time-consuming. I've tried automating it using VLMs (like Qwen3-VL) but unfortunately they can miss the small details and didn't provide as much value as I was hoping.