Oh wow, I've been hearing about Nano Banana Pro in random stuff lately, but as a layman the difference is stark. It's the only one that actually looks like a partially eaten burrito at all to me. The others all look like staged marketing fake food, if I'm being generous (only a few actually approach that, most just look wrong).
This shows some gaps in the "same prompt to every model" approach to benchmarking models.
I get that it's allows ensuring you're testing the model capabilities vs prompts, but most models are being post-trained with very different formats of prompting.
I use Seedream in production so I was a little suspicious of the gap: I passed Bytedance's official prompting guide, OPs prompt, and your feedback to Claude Opus 4.5 and got this prompt to create a new image:
> A partially eaten chicken burrito with a bite taken out, revealing the fillings inside: shredded cheese, sour cream, guacamole, shredded lettuce, salsa, and pinto beans all visible in the cross-section of the burrito. Flour tortilla with grill marks. Taken with a cheap Android phone camera under harsh cafeteria lighting. Compostable paper plate, plastic fork, messy table. Casual unedited snapshot, slightly overexposed, flat colors.
Then I generated with n=4 and the 'standard' prompt expansion setting for Seedream 4.0 Text To Image:
They're still not perfect (it's not adhering to the fillings being inside for example) but it's massively better than OP's result
Shows that a) random chance plays a big part, so you want more than 1 sample and b) you don't have to "cheat" by spending massive amounts of time hand-iterating on a single prompt either to get a better result
Hunyuan V3 is the only other one that plausibly has a bite taken. The weirdness of the fillings being decoratively sprinkled on top of it does rather count against it, though.
I don't know if it's the abundance of stock photos in the set or the training, but the 'hypertune' default look of AI photos drives me crazy. Things are super smooth, the colors pop wildly, the depth of field is really shallow, everything is overly posed, details far too sharp, etc. Vaguely reminds me of the weird skin-crawler filter levels used by people like mr beast.
I think it is the fine tuning, because you can find AI photos that look more like real ones. I guess people prefer obviously fake looking 'picturesque' photos to more realistic ones? Maybe it's just because the money is in selling to people generating marketing materials? NB is clearly the only model here which permits a half eaten burrito to actually appear to have been bitten.
Someone on reddit made a "real or nano banana pro image" website for people to test if they could spot generated images. The running average was 50% accuracy.
It looks like they took the page down now though...
The NBP looks like a mock of food to me - the unwrapped burrito on a single piece of intact tinfoil, a table where the grain goes all wonky, an almost pastry looking tortilla, hyperrealistic beans and there's something wrong with the focal plane.
It's just not as plasticy and oversaturated as the others.
One of my tests for new image generation models is professional food photography, particularly in cases where the food has constraints, such as "a peanut butter and jelly sandwich in the shape of a Rubik’s cube" (blog post from 2022 for DALL-E 2: https://minimaxir.com/2022/07/food-photography-ai/ )
For some reason ever since DALL-E 2, all food models seem to generate obviously fake food and/or misinterpret the fun constraints...until Nano Banana. Now I can generate fractal Sierpiński triangle peanut butter and jelly sandwiches.
I can kind of see what you mean in that it went for realism in the aesthetics, but not the object... but that last one would probably fool me if I was scrolling
An interesting American culinary divide is between Scottsdale and Phoenix homemade burritos. The former being close to the Midwest variety, the latter to a Sonoran style.
Even ignoring the Heinz bean outliers, these are all decidedly Scottsdale. With one exception. All hail Nano Banana.
They all just look like generic Mission burritos to me (leaning towards fast food menu photos), except some include lettuce and some have blisters sonoran style. Only Nano Banana really looks like something I'd get at El Farolito or an LA food truck.
hrm. yea you're right. the page on fal used to produce it was linked with the image, but maybe i made a mistake and sloppily saved wrong one. ill have to reroll to check
I don't eat a lot of burritos and when I do they aren't bean burritos, so I'm honestly wondering: do they commonly have whole beans in them? I expect that if they do, they aren't often so clean and shiny looking, but what I expected is more of a mushy/refried bean look.
Do people get burritos with beans in them more or less as pictured? Aesthetically, it seems like it'd look pretty appealing if you were someone who loved beans compared to what I had in mind, but again I'm really in no position to judge these images based on bean appearance.
It's entirely possible to have clean, whole beans in a burrito. It's unusual in commercial kitchens because whole beans are kept warm in the cooking broth until service to avoid drying out. The preparer will scoop them out with a slotted spoon to drain, but usually they're in too much of a hurry to fully drain it. Product photos don't rush this step because a soggy burrito doesn't look good on camera and they also undercook things so the ingredients don't mush. AI tools have ingested a lot more product photos than real burritos.
With llms there is a secondary training step to turn a foundational model into a chat bot. Is these something similar going on with these image generation models, that is making them all tend towards making pretty clean images and stopping them making half eaten food even if they have the capabilities?
In terms of prompt adherence, there are two issues with most image generation models, neither of which apply to Nano Banana:
1. The text encoders are primitive (e.g. CLIP) and have difficulty with nuance, such as "partially eaten", and model training can only partially overcome it. It's the same issue with the now-obsolete "half-filled" wine glass test.
2. Most models are diffusion-based, which means it denoises the entire image simultaneously. If it fails to account for the nuance in the first few passes, it can't go back and fix it.
I believe some image generation AIs were RLHFed like chat bot LLMs, but moreso to improve aesthetics rather than prompt adherence.
kbenson|3 months ago
BoorishBears|3 months ago
I get that it's allows ensuring you're testing the model capabilities vs prompts, but most models are being post-trained with very different formats of prompting.
I use Seedream in production so I was a little suspicious of the gap: I passed Bytedance's official prompting guide, OPs prompt, and your feedback to Claude Opus 4.5 and got this prompt to create a new image:
> A partially eaten chicken burrito with a bite taken out, revealing the fillings inside: shredded cheese, sour cream, guacamole, shredded lettuce, salsa, and pinto beans all visible in the cross-section of the burrito. Flour tortilla with grill marks. Taken with a cheap Android phone camera under harsh cafeteria lighting. Compostable paper plate, plastic fork, messy table. Casual unedited snapshot, slightly overexposed, flat colors.
Then I generated with n=4 and the 'standard' prompt expansion setting for Seedream 4.0 Text To Image:
https://imgur.com/a/lxKyvlm
They're still not perfect (it's not adhering to the fillings being inside for example) but it's massively better than OP's result
Shows that a) random chance plays a big part, so you want more than 1 sample and b) you don't have to "cheat" by spending massive amounts of time hand-iterating on a single prompt either to get a better result
kemayo|3 months ago
recursivecaveat|3 months ago
I think it is the fine tuning, because you can find AI photos that look more like real ones. I guess people prefer obviously fake looking 'picturesque' photos to more realistic ones? Maybe it's just because the money is in selling to people generating marketing materials? NB is clearly the only model here which permits a half eaten burrito to actually appear to have been bitten.
Workaccount2|3 months ago
It looks like they took the page down now though...
unknown|3 months ago
[deleted]
Aloisius|3 months ago
It's just not as plasticy and oversaturated as the others.
iambateman|3 months ago
The “partially eaten” part of the prompt is interesting…everyone knows what a half-eaten burrito looks like but clearly the computers struggle.
blinding-streak|3 months ago
minimaxir|3 months ago
For some reason ever since DALL-E 2, all food models seem to generate obviously fake food and/or misinterpret the fun constraints...until Nano Banana. Now I can generate fractal Sierpiński triangle peanut butter and jelly sandwiches.
BoorishBears|3 months ago
I can kind of see what you mean in that it went for realism in the aesthetics, but not the object... but that last one would probably fool me if I was scrolling
vunderba|3 months ago
https://mordenstar.com/portfolio/wontauns
JumpCrisscross|3 months ago
Even ignoring the Heinz bean outliers, these are all decidedly Scottsdale. With one exception. All hail Nano Banana.
throwup238|3 months ago
throwup238|3 months ago
jasonthorsness|3 months ago
ruined|3 months ago
skocznymroczny|3 months ago
pathdependent|3 months ago
drob518|3 months ago
_joel|3 months ago
autoexec|3 months ago
Do people get burritos with beans in them more or less as pictured? Aesthetically, it seems like it'd look pretty appealing if you were someone who loved beans compared to what I had in mind, but again I'm really in no position to judge these images based on bean appearance.
AlotOfReading|3 months ago
stickfigure|3 months ago
N_Lens|3 months ago
elzbardico|3 months ago
totetsu|3 months ago
minimaxir|3 months ago
1. The text encoders are primitive (e.g. CLIP) and have difficulty with nuance, such as "partially eaten", and model training can only partially overcome it. It's the same issue with the now-obsolete "half-filled" wine glass test.
2. Most models are diffusion-based, which means it denoises the entire image simultaneously. If it fails to account for the nuance in the first few passes, it can't go back and fix it.
I believe some image generation AIs were RLHFed like chat bot LLMs, but moreso to improve aesthetics rather than prompt adherence.
adammarples|3 months ago
jfim|3 months ago
visioninmyblood|3 months ago
willio58|3 months ago
digitcatphd|3 months ago
foobarbecue|3 months ago
basket_horse|3 months ago
koakuma-chan|3 months ago
ilaksh|3 months ago
jwojtek|3 months ago
corpMaverick|3 months ago
namegulf|3 months ago
unknown|3 months ago
[deleted]