Nano-Banana can produce some astonishing results. I maintain a comparison website for state-of-the-art image models with a very high focus on adherence across a wide variety of text-to-image prompts.
I recently finished putting together an Editing Comparison Showdown counterpart where the focus is still adherence but testing the ability to make localized edits of existing images using pure text prompts. It's currently comparing 6 multimodal models including Nano-Banana, Kontext Max, Qwen 20b, etc.
Gemini Flash 2.5 leads with a score of 7 out of 12, but Kontext comes in at 5 out of 12 which is especially surprising considering you can run the Dev model of it locally.
Don't know if it's the same for others, but my issue with Nano Banana has been the opposite. Ask it to make x significant change, and it spits out what I would've sworn is the same image. Sometimes randomly and inexplicably it spits our the expected result.
Anyone else experiencing this or have solutions for avoiding this?
Great comparison! Bookmarked to follow. Keep an eye on Grok, they're improving at a very rapid rate and I suspect they'll be near the top in not too distant future.
This is the first time I really don't understand how people are getting good results. On https://aistudio.google.com with Nano Banana selected (gemini-2.5-flash-image-preview) I get - garbage - results. I'll upload a character reference photo and a scene and ask Gemini to place the character in the scene. What it then does is to simply cut and paste the character into the scene, even if they are completely different in style, colours, etc.
I get far better results using ChatGPT for example. Of course, the character seldom looks anything like the reference, but it looks better than what I could do in paint in two minutes.
Through that testing, there is one prompt engineering trend that was consistent but controversial: both a) LLM-style prompt engineering with with Markdown-formated lists and b) old-school AI image style quality syntatic sugar such as award-winning and DSLR camera are both extremely effective with Gemini 2.5 Flash Image, due to its text encoder and larger training dataset which can now more accurately discriminate which specific image traits are present in an award-winning image and what traits aren't. I've tried generations both with and without those tricks and the tricks definitely have an impact. Google's developer documentation encourages the latter.
Unfortunately NSFW in parts. It might be insensitive to circulate the top URL in most US tech workplaces. For those venues, maybe you want to pick out isolated examples instead.
(Example: Half of Case 1 is an anime/manga maid-uniform woman lifting up front of skirt, and leaning back, to expose the crotch of underwear. That's the most questionable one I noticed. It's one of the first things a visitor to the top URL sees.)
Personally, I'm underwhelmed by this model. I feel like these examples are cherry-picked. Here are some fails I've had:
- Given a face shot in direct sunlight with severe shadows, it would not remove the shadows
- Given an old black and white photo, it would not render the image in vibrant color as if taken with a modern DSLR camera. It will colorize the photo, but only with washed out, tinted colors
- When trying to reproduce the 3 x 3 grid of hair styles, it repeatedly created a 2x3 grid. Finally, it made a 3x3 grid, but one of the nine models was black instead of caucasian.
- It is unable to integrate real images into fabricated imagery. For example, when given an image of a tutu and asked to create an image of a dolphin flying over clouds wearing the tutu, the result looks like a crude photoshop snip and copy/paste job.
This is amazing. Not that long ago, even getting a model to reliably output the same character multiple times was a real challenge. Now we’re seeing this level of composition and consistency. The pace of progress in generative models is wild.
Huge thanks to the author (and the many contributors) as well for gathering so many examples; it’s incredibly useful to see them to better understand the possibilities of the tool.
I've come to realize that I liked believing that there was something special about the human mental ability to use our mind's eye and visual imagination to picture something, such as how we would look with a different hairstyle. It's uncomfortable seeing that skill reproduced by machinery at the same level as my own imagination, or even better. It makes me feel like my ability to use my imagination is no more remarkable than my ability to hold a coat off the ground like a coat hook would.
As someone who can’t visualize things like this in my head, and can only think about them intellectually, your own imagination is still special. When I heard people can do that, it sounded like a super power.
AI is like Batman, useless without his money and utility belt. Your own abilities are more like Superman, part of who you are and always with you, ready for use.
But you can find joy at things you envision, or laugh, or be horrified. The mental ability is surely impressive, but having a reason to do it and feeling something at the result is special.
"To see a world in a grain of sand
And a heaven in a wild flower..."
We - humans - have reasons to be. We get to look at a sunset and think about the scattering of light and different frequencies and how it causes the different colors. But we can also just enjoy the beauty of it.
For me, every moment is magical when I take the time to let it be so. Heck, for there to even be a me responding to a you and all of the things that had to happen for Hacker News to be here. It's pretty incredible. To me anyway.
The proof in the pudding will be if machines will be able to develop new art styles. For example, there is a progression in comic/manga/anime art styles over the decades. If humans would stop (they probably won't) that kind of progression, would machines be able to continue it? In principle yes (we are biological machines of sorts), but likely not with the current AI architecture.
Vision has evolved frequently and quickly in the animal kingdom.
Conscious intelligence has not.
As another argument, we've had mathematical descriptions of optics, drawing algorithms, fixed function pipeline, ray tracing, and so much more rich math for drawing and animating.
Smart, thinking machines? We haven't the faintest idea.
Seriously? One could always cut-and-paste (not the computer term) a hairstyle over a photo of a person.
You are now marvelling at someone taking the collective output of humans around the world, then training a model on it with massive, massive compute… and then having a single human compete with that model.
Without the human output on the Internet, none of this would be possible. ImageNet was positively small compared to this.
But yeah, what you call “imagination” is basically perturbations and exploration across a model that you have in your head, which imposes constraints (eg gravity etc) that you learned. Obviously we can remix things now that they’re on the Internet.
Having said that, after all that compute, the models had trouble rendering clocks that show an arbitrary time, or a glass of wine filled to the brim.
Does a pretty good job (most of the time) of sticking to the black and white coloring book style while still bringing in enough detail to recognize the original photo in the output.
Man, I hate this. It all looks so good, and it's all so incorrect. Take the heart diagram, for example. Lots of words that sort of sound cardiac but aren't ("ventricar," "mittic"), and some labels that ARE cardiac, but are in the wrong place. The scenes generated from topo maps look convincing, but they don't actually follow the topography correctly. I'm not looking forward to when search and rescue people start using this and plan routes that go off cliffs. Most people I know are too gullible to understand that this is a bullshit generator. This stuff is lethal and I'm very worried it will accelerate the rate at which the populace is getting stupider.
Impressive examples but for GenAI it always comes down to the fact that you have to cherry pick the best result after so many fail attempts. Right now, it feels like they're pushing the narrative that ExpectedOutput = LLM(Prompt, Input) when it's actually ExpectedOutput = LLM(Prompt, Input) * Takes where Takes can vary from 1 to 100 or more
I think it might be the same as with programmers. It might look like AI Agents can do all the programming, but when you actually try to use it do do things it quickly turns out to be not so much reliable.
One thing that couldn't be done is transparent background. The model just generates the pattern in the background. Not real alpha channel transparency. You can even see artifacts in the pattern.
The training data is presumably full of examples of people using the pattern to indicate transparency (and explaining that they do so — like the input for 50!), and much less of people actually creating such images (if the training data even preserves the alpha channel in the first place).
I think a bigger problem is the "artifacts" you describe (worse than that sounds to me).
Yeah, mangled checkerboard patterns are common when prompted to "remove" the background. It can be worked around by generating multiple images with only the background color varying (e.g. black and white) and reconstructing the alpha channel from their difference, as the model generally prefers to just copy and paste when no other prompts override that preference.
Does anyone else cringe when they see so many examples of sexualised young women? Literally, Case 1/B has a women lifting up her skirt to reveal her underwear. For an otherwise very impressive model, you are spoiling the PR with this kind of immature content. Sheesh. I guess that confirms it: I am a old grumpy man! I count 26 examples with young women, and 9 examples with men. The only thing missing was "Lena": https://en.wikipedia.org/wiki/Lenna
I had to scroll down way too long for someone to point this out. Its messed up how casually racialised all these image gen examples are towards young asian women.
wait until you learn what prehistoric sculptors spent their time carving
I read your comment before checking the site and then I saw case one was a child followed by a sexy maid and I thought "oh no dear god" before I realized they weren't combining them into a single image.
While I think most of the examples are incredible...
...the technical graphics (especially text) is generally wrong. Case 16 is an annotated heart and the anatomy is nonsensical. Case 28 with the tallest buildings has the decent images, but has the wrong names, locations, and years.
I'm furnishing a new apartment and Nano Banana has been super useful for placing furniture I want to purchase in rooms to make a judgment if things will work for us or not. Take a picture of the room, feed Nano Banana with that picture and the product picture and ask it to place it in the right location. It can even imagine things at night or even add lamps with lights on. Super useful!
[+] [-] vunderba|6 months ago|reply
I recently finished putting together an Editing Comparison Showdown counterpart where the focus is still adherence but testing the ability to make localized edits of existing images using pure text prompts. It's currently comparing 6 multimodal models including Nano-Banana, Kontext Max, Qwen 20b, etc.
https://genai-showdown.specr.net/image-editing
Gemini Flash 2.5 leads with a score of 7 out of 12, but Kontext comes in at 5 out of 12 which is especially surprising considering you can run the Dev model of it locally.
[+] [-] user_7832|6 months ago|reply
Don't know if it's the same for others, but my issue with Nano Banana has been the opposite. Ask it to make x significant change, and it spits out what I would've sworn is the same image. Sometimes randomly and inexplicably it spits our the expected result.
Anyone else experiencing this or have solutions for avoiding this?
[+] [-] tdalaa|6 months ago|reply
[+] [-] xnx|6 months ago|reply
Since the page doesn't mention it, this is the Google Gemini Image Generation model: https://ai.google.dev/gemini-api/docs/image-generation
Good collection of examples. Really weird to choose an inappropriate for work one as the second example.
[+] [-] warkdarrior|6 months ago|reply
[+] [-] smrtinsert|6 months ago|reply
[+] [-] minimaxir|6 months ago|reply
[+] [-] plomme|6 months ago|reply
I get far better results using ChatGPT for example. Of course, the character seldom looks anything like the reference, but it looks better than what I could do in paint in two minutes.
Am I using the wrong model, somehow??
[+] [-] minimaxir|6 months ago|reply
Through that testing, there is one prompt engineering trend that was consistent but controversial: both a) LLM-style prompt engineering with with Markdown-formated lists and b) old-school AI image style quality syntatic sugar such as award-winning and DSLR camera are both extremely effective with Gemini 2.5 Flash Image, due to its text encoder and larger training dataset which can now more accurately discriminate which specific image traits are present in an award-winning image and what traits aren't. I've tried generations both with and without those tricks and the tricks definitely have an impact. Google's developer documentation encourages the latter.
However, taking advantage of the 32k context window (compared to 512 for most other models) can make things interesting. It’s possible to render HTML as an image (https://github.com/minimaxir/gemimg/blob/main/docs/notebooks...) and providing highly nuanced JSON can allow for consistent generations. (https://github.com/minimaxir/gemimg/blob/main/docs/notebooks...)
[+] [-] voidUpdate|6 months ago|reply
- The second one in case 2 doesn't look anything like the reference map
- The face in case 5 changes completely despite the model being instructed to not do that
- Case 8 ignores the provided pose reference
- Case 9 changes the car positions
- Case 16 labels the tricuspid in the wrong place and I have no idea what a "mittic" is
- Case 27 shows the usual "models can't do text" though I'm not holding that against it too much
- Same with case 29, as well as the text that is readable not relating to the parts of the image it is referencing
- Case 33 just generated a generic football ground
- Case 37 has nonsensical labellings ("Define Jawline" attached to the eye)
- Case 58 has the usual "models don't understand what a wireframe is", but again I'm not holding that against it too much
Super nice to see how honest they are about the capabilities!
[+] [-] neilv|6 months ago|reply
(Example: Half of Case 1 is an anime/manga maid-uniform woman lifting up front of skirt, and leaning back, to expose the crotch of underwear. That's the most questionable one I noticed. It's one of the first things a visitor to the top URL sees.)
[+] [-] istjohn|6 months ago|reply
- Given a face shot in direct sunlight with severe shadows, it would not remove the shadows
- Given an old black and white photo, it would not render the image in vibrant color as if taken with a modern DSLR camera. It will colorize the photo, but only with washed out, tinted colors
- When trying to reproduce the 3 x 3 grid of hair styles, it repeatedly created a 2x3 grid. Finally, it made a 3x3 grid, but one of the nine models was black instead of caucasian.
- It is unable to integrate real images into fabricated imagery. For example, when given an image of a tutu and asked to create an image of a dolphin flying over clouds wearing the tutu, the result looks like a crude photoshop snip and copy/paste job.
[+] [-] darkamaul|6 months ago|reply
Huge thanks to the author (and the many contributors) as well for gathering so many examples; it’s incredibly useful to see them to better understand the possibilities of the tool.
[+] [-] mitthrowaway2|6 months ago|reply
[+] [-] al_borland|6 months ago|reply
AI is like Batman, useless without his money and utility belt. Your own abilities are more like Superman, part of who you are and always with you, ready for use.
[+] [-] lemonberry|6 months ago|reply
"To see a world in a grain of sand And a heaven in a wild flower..."
We - humans - have reasons to be. We get to look at a sunset and think about the scattering of light and different frequencies and how it causes the different colors. But we can also just enjoy the beauty of it.
For me, every moment is magical when I take the time to let it be so. Heck, for there to even be a me responding to a you and all of the things that had to happen for Hacker News to be here. It's pretty incredible. To me anyway.
[+] [-] FuckButtons|6 months ago|reply
[+] [-] m3kw9|6 months ago|reply
[+] [-] layer8|6 months ago|reply
[+] [-] micromacrofoot|6 months ago|reply
[+] [-] echelon|6 months ago|reply
Conscious intelligence has not.
As another argument, we've had mathematical descriptions of optics, drawing algorithms, fixed function pipeline, ray tracing, and so much more rich math for drawing and animating.
Smart, thinking machines? We haven't the faintest idea.
Progress on Generative Images >> LLMs
[+] [-] EGreg|6 months ago|reply
You are now marvelling at someone taking the collective output of humans around the world, then training a model on it with massive, massive compute… and then having a single human compete with that model.
Without the human output on the Internet, none of this would be possible. ImageNet was positively small compared to this.
But yeah, what you call “imagination” is basically perturbations and exploration across a model that you have in your head, which imposes constraints (eg gravity etc) that you learned. Obviously we can remix things now that they’re on the Internet.
Having said that, after all that compute, the models had trouble rendering clocks that show an arbitrary time, or a glass of wine filled to the brim.
[+] [-] stuckkeys|6 months ago|reply
[+] [-] kylebenzle|6 months ago|reply
[deleted]
[+] [-] dbish|6 months ago|reply
Does a pretty good job (most of the time) of sticking to the black and white coloring book style while still bringing in enough detail to recognize the original photo in the output.
[+] [-] foobarbecue|6 months ago|reply
[+] [-] rimmontrieu|6 months ago|reply
[+] [-] wu1064442747|5 months ago|reply
Nano Banana AI | Professional Image Editor & Generator | Nano Banana
Nano Banana AI editor powered by Google Gemini. Remove backgrounds, swap faces, create avatars. Professional nano banana image editing made simple.
[+] [-] Animats|6 months ago|reply
[+] [-] metaphor|6 months ago|reply
[+] [-] kertoip_1|6 months ago|reply
[+] [-] raincole|6 months ago|reply
[+] [-] SweetSoftPillow|6 months ago|reply
[+] [-] mustaphah|6 months ago|reply
[1] https://github.com/JimmyLv/awesome-nano-banana
[+] [-] twaldecker|6 months ago|reply
[+] [-] zahlman|6 months ago|reply
I think a bigger problem is the "artifacts" you describe (worse than that sounds to me).
[+] [-] lifthrasiir|6 months ago|reply
[+] [-] throwaway2037|6 months ago|reply
[+] [-] yomismoaqui|6 months ago|reply
VHS, online payments, video streaming... As the old song say it "the internet is porn"
[+] [-] shermantanktop|6 months ago|reply
[+] [-] unknown|6 months ago|reply
[deleted]
[+] [-] GNaLVEre|6 months ago|reply
[+] [-] ants_everywhere|6 months ago|reply
I read your comment before checking the site and then I saw case one was a child followed by a sexy maid and I thought "oh no dear god" before I realized they weren't combining them into a single image.
[+] [-] krapp|6 months ago|reply
[+] [-] unknown|6 months ago|reply
[deleted]
[+] [-] HeartStrings|6 months ago|reply
[deleted]
[+] [-] FearNotDaniel|6 months ago|reply
[deleted]
[+] [-] eig|6 months ago|reply
...the technical graphics (especially text) is generally wrong. Case 16 is an annotated heart and the anatomy is nonsensical. Case 28 with the tallest buildings has the decent images, but has the wrong names, locations, and years.
[+] [-] mohsen1|6 months ago|reply
[+] [-] _def|6 months ago|reply