top | item 34474270

Show HN: New AI edits images based on text instructions

1098 points| bryced | 3 years ago |github.com | reply

This works suprisingly well. Just give it instructions like "make it winter" or "remove the cars" and the photo is altered.

Here are some examples of transformations it can make: Golden gate bridge: https://raw.githubusercontent.com/brycedrennan/imaginAIry/ma... Girl with a pearl earring: https://raw.githubusercontent.com/brycedrennan/imaginAIry/ma...

I integrated this new InstructPix2Pix model into imaginAIry (python library) so it's easy to use for python developers.

227 comments

order
[+] sandworm101|3 years ago|reply
Fireworks. These AI tools seem very good at replacing textures, less so about inserting objects. They can all "add fireworks" to a picture. They know what fireworks look like and diligently insert them into "sky" part of pictures. But they don't know that fireworks are large objects far away rather than small objects up close (see the Father Ted bit on that one). So they add tiny fireworks into pictures that don't have a far away portion (portraits) or above distant mountain ridges as if they were stars. Also trees. The AI doesn't know how big trees are and so inserts monster trees under the Golden Gate bridge and tiny bonsais into portraits. Adding objects into complex images is totally hit and miss.
[+] jagaerglad|3 years ago|reply
Another thing was the "Bald" for the girl with the pearl earring, seems like it doesn't about things like ponytails under headdresses
[+] ricardobeat|3 years ago|reply
The new models that take depth estimation into account will probably solve this.
[+] taberiand|3 years ago|reply
Perhaps stereoscopic video should be part of the training data?
[+] andrijeski|3 years ago|reply
Given the appropriate training and test set, we can build a model that can overcome these issues, right?
[+] PaulMest|3 years ago|reply
I've played with several of these Stable Diffusion frameworks and followed many tutorials and imaginAIry fit my workflow the best. I actually wrote Bryce a thank you email in December after I made an advent calendar for my wife. Super excited to see continued development here to make this approachable to people who are familiar with Python, but don't want to deal with a lot of the overhead of building and configuring SD pipelines.
[+] bryced|3 years ago|reply
Thanks Paul!
[+] nicbou|3 years ago|reply
Can it make it pop? Because that was the #1 request I remember dealing with.
[+] awestroke|3 years ago|reply
Try these prompts:

"Add lens flare"

"Increase saturation"

"Add sparkles and gleam"

[+] TekMol|3 years ago|reply
#1 request of what, for what, requested by whom?
[+] tamrix|3 years ago|reply
Maybe but it could put their business logo anywhere!
[+] perfrom1|3 years ago|reply
this should do it:

>> aimg edit input.jpg "make it pop" --prompt-strength 25

[+] bryced|3 years ago|reply
Here is a colab you can try it in. It crashed for me the first time but worked the second time. https://colab.research.google.com/drive/1rOvQNs0Cmn_yU1bKWjC...
[+] iuiz|3 years ago|reply
I could not get the first cell to run.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. tensorflow 2.9.2 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.3 which is incompatible. tensorboard 2.9.1 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.3 which is incompatible.

[+] cbeach|3 years ago|reply
EDIT - it's free of charge: https://research.google.com/colaboratory/faq.html

---

First time I've used "colab" - looks great. Out of interest, who pays for the compute used by this?

Is it freely offerred by Google? Or is it charged to my Google API account when I use it? Or your account? It wasn't clear in the UI.

[+] Tenoke|3 years ago|reply
Huh, I'm trying it now and the results seem so weak compared to any other model I've seen since dall-e.
[+] Damirakyan|3 years ago|reply
How would I upgrade to 2.1 if running locally?
[+] Daub|3 years ago|reply
The language of high-level art-direction can be way more complex than one might assume. I wonder how this model might cope with the following:

‘Decrease high-frequency features of background.’

‘Increase intra-contrast of middle ground to foreground.’

‘Increase global saturation contrast.’

‘Increase hue spread of greens.’

[+] CyanBird|3 years ago|reply
They behave quite poorly, because the keywords used by the models are layman language not technical art or color correction/color grading-speak

Hopefully in a couple of years when things have matured more there will be more models capable of handling said requests

The most precise models are actually anime models because the users have got high standards for telling the machine what they expect of it and the databases are quite well annotated (booru tags)

[+] GordonS|3 years ago|reply
What are the most affordable GPUs that will run this? (it said it needs CUDA, min 11GB VRAM, so I guess my relatively puny 4GB 570RX isn't going to cut it!)
[+] b33j0r|3 years ago|reply
Can you imagine only being able to cook a hamburger on one brand of grill? But you can make something kinda similar in the toaster oven you can afford?

I want to be productive on this comment… but the crypto/cuda nexus of GPU work is simply not rational. Why are we still here?

You want to work in this field? Step 1. Buy an NVIDIA gpu. Step 2. CUDA. Step 3. Haha good luck, not available to purchase.

This situation is so crazy. My crappiest computer is way better at AI, just because I did an intel/nvidia build.

I don’t hate NVIDIA for innovating. The stagnation and risk of monopoly setting us back for unnecessary generations makes me a bit miffed.

So. To attempt to be productive here, what am I not seeing?

[+] ColonelPhantom|3 years ago|reply
The cheapest NVidia GPU with 11+GB VRAM is probably the 2060 12GB, although the 3060 12GB would be a better choice.

The setup.py file seems to indicate that PyTorch is used, which I think can also run on AMD GPUs, provided you are on Linux.

[+] smallerfish|3 years ago|reply
It works fine on CPUs. Takes about a minute to generate images on my 8 core i7 desktop.
[+] bryced|3 years ago|reply
I'm running on a 2080 TI and an edit runs in 2 seconds. On my Apple M1 Max 32Gb edits take about 60 seconds.
[+] kadoban|3 years ago|reply
Whatever 3060 variant that has the most VRAM is probably your best shot these days.
[+] cbeach|3 years ago|reply
On the strength of this HN submission I just ordered an RTX 3060 12GB card for £381 on Amazon so I can run this and future AI models.

This stuff is fascinating, and @bryced's imaginAIry project made it accessible to people like me who never had any formal training in machine learning.

[+] singhrac|3 years ago|reply
For what it's worth, it ran fine on my 2070 (8GB of VRAM), even with the GPU being used to render my desktop (Windows), which used another ~800MB of VRAM. I was running it under WSL, which also worked fine.

Note the level of investment that NVIDIA's software team has here: they have a separate WSL-Ubuntu installation method that takes care not to overwrite Windows drivers but installs the CUDA toolkit anyway. I expected this to be a niche, brittle process, but it was very well supported.

[+] CptanPanic|3 years ago|reply
Google colab free acount, you get access to 15GB vRAM T4 GPU, or Kaggle which gives you access to 2xT4's, or one P100 GPU.
[+] yieldcrv|3 years ago|reply
“Add a dog in my arms”

I’ll keep you posted how well this works for dating apps

[+] sschueller|3 years ago|reply
I am not a fan of software such as this putting in an arbitrarily "safety" feature which can only be disabled via undocumented environment variable. At least make it a flag documented for people who don't have an issue with nudity. There isn't even an indication that there is a "safety" issue, you just get a blank image and are wondering if your GPU/model or install is corrupted.

This isn't running on a website that is open to everyone or can be easily run by a novice.

Anyone capable of installing and running this is also able to read code and remove such a feature. There is no reason to hide this nor to not document it.

Also the amount of nudity you get is also highly dependent on which model you use.

[+] social_quotient|3 years ago|reply
Slightly off topic.

I’ve been looking for an easier way to replace the text in these ai generated images. I found Facebook is working on it with their TextStyleBrush - https://ai.facebook.com/blog/ai-can-now-emulate-text-style-i... but have been unable to find something released or usable yet. Anyone aware of other efforts?

[+] TeMPOraL|3 years ago|reply
> Here are some examples of transformations it can make: Golden gate bridge:

I'm on mobile so can't try this myself now. Can it add a Klingon bird of prey flying under the Golden Gate Bridge, and will "add a Klingon bird of prey flying under the Golden Gate Bridge" prompt/command be enough?

[+] wongarsu|3 years ago|reply
No. At least not with the Stable Diffusion 1.5 checkpoint used in the colab notebook. It seems to only have a very vague idea of what a Klingon bird of prey is. The best I could get in ~30 images was [1], and that's with slight prompt tweaks and a negative prompt to discourage falcons and eagles.

1: https://i.imgur.com/gDj2Kn4.png

[+] anigbrowl|3 years ago|reply
A CUDA supported graphics card with >= 11gb VRAM (and CUDA installed) or an M1 processor.

/Sighs in Intel iMac

Has anyone managed to get an eGPU running under MacOS? I guess I could use Colab but I like the feeling of running things locally.

[+] perfopt|3 years ago|reply
How does this work? When I run it on a machine with a GPU (pytorch, CUDA etc installed) I still see it downloading files for each prompt. Is the image being generated on the cloud somewhere or on my local machine? Why the downloads?
[+] bryced|3 years ago|reply
Shouldn't be downloads per prompt. Processing happens on your machine. It does download models as needed. A network call per prompt would be a bug.
[+] c7b|3 years ago|reply
Many thanks to the OP, can't wait to try this out! I have a question I'm hoping to slide in here: I remember there were also solutions for doing things like "take this character and now make it do various things". Does anyone remember what the general term for that was, and some solutions (pretty sure I've seen this on here, apparently forgot to bookmark).

PS: I'm not trying to make a comic book, I'm trying to help a friend solve a far more basic business problem (trying to get clients to pay their bills on time).

[+] bryced|3 years ago|reply
dreambooth perhaps?
[+] zepearl|3 years ago|reply
Thanks a lot!!!

Works perfectly for me (Gentoo Linux + nVidia RTX3060 12GiB VRAM - I installed last week your package and it just worked, experimenting with it since then, telling about it parents & colleagues).

The results (especially in relation to "people's faces") can vary a lot between ok/scary/great (I still have to understand how the options/parameters work), all in all it's a great package that's easy to handle & use.

In general, if I don't specify a higher output resolution setting than the default (512x386 or something similar), with e.g. "-w 1024 -h 768", then faces get garbled/deformed like straight from a Stephen King novel => is this expected?

Cheers :)

[+] karim79|3 years ago|reply
I've been toying with SD for a while, and I do want to make a nice and clean business out of it. It's more of a side-projecty thing so to speak.

Our "cluster" is running on a ASUS ROG 2080Ti external GPU in the razer core-x housing, and that actually works just fine in my flat.

We went through several iterations of how this could work at scale. The initial premise was basically the google homepage, but for images.

That's when we realised that scaling this to serve the planet was going go be a hell of a lot more work. But not really, conceptualising the concurrent compute requirements as well as the ever-changing landscape and pace of innovation in this absolutely necessary.

The quick fix is to use a message queue (we're using Bull) and make everything asynchronous.

So essentially, we solved the scaling factor using just one GPU. You'll get your requested image, but it's in a queue, we'll let you know when it's done. With that compute model in place, we can just add more GPUs, and tickets will take less time to serve if the scale engineering is proper.

I'm no expert on GPU/Machine learning/GAN stuff but Stable Diffusion actually prompted me to imagine how to build and scale such a service, and I did so. It is not live yet, but when it does become so the name reserved is dreamcreator dot ai, and I can't say when it will be animated. Hopefully this year.

[+] dandigangi|3 years ago|reply
This is really cool. Haven't seen something like this yet. Going to be very interesting when you start to see E2E generation => animation/video/static => post editing => repeat. Have this feeling that movie studios are going to look into this kind of stuff. We went from real to CGI and this could take it to new levels in cost savings or possibilities.
[+] dandigangi|3 years ago|reply
Played around for a bit. Definitely a cool tool. Wish I had an M1 though. Taking me quite a bit to generate and fans running at full blast. Haha
[+] sebastiennight|3 years ago|reply
It's very interesting, thanks! I've noticed (on the Spock example) that "make him smile" didn't produce a very... "comely" result (he basically becomes a vampire).

I was thinking of deploying something like that in one of our app features, but I'm scared of making our Users look like vampires :-)

Is it your experience that the model struggles more with faces than with other changes?

[+] bryced|3 years ago|reply
Yes if you're not careful it can ruin the face. You can play with the strength factor to see if something can be worked out. Bigger faces are safer.