How to Build Your Own AI-Generated Images with ControlNet and Stable Diffusion

ControlNet model specifically the scribble ControlNet (and ComfyUI) was major gamechanger for me.

Was getting good results with just SD and occassional masking but it would take hours and hours to hone in and composite a complex scene with specific requirements & shapes (with most of the work spent currating the best outputs and then blending them into a scene with Gimp/Inkscape).

Masking is unintuitive compared to the scribble which gets similar effect; no need to paint masks (which is disruptive to the natural process of 'drawing' IMO) instead just make a general black and white outline of your scene. Simply dial up/down the conditioning strength to have it more tightly or fuzzily follow that outline.

You can also use Gimp's Threshold or Inkscape Trace Bitmap tool to get a decent black & white outline from an existing bitmap to expedite the scribble procedure.

fsloth|2 years ago

Comfy ui is really nice. The fact that the node graph is saved as png metadata actually makes node based workflows super fluent and reproducible since all you need to do to get the graph for your image is to drag and drop the result png to the gui. This feels like a huge quality-of-life improvement compared to any other lightweight node tools I’ve used.

l33tman|2 years ago

You don't need to go through Gimp or Inkscape, this is built-in to the auto1111 ControlNet UI. You just dump the existing photo there and you can select a bunch of pre-processors like edge-detection or 3D depth extraction, which is then fed into ControlNet to generate a new image.

This is super powerful for example visualizing the renovation of an apartment room or house exterior.

gkeechin|2 years ago

That's for sure - I think we have seen other kind of edge detector or filter work better for differing use cases, especially around foreground images where you want to retain more information (i.e. images with small nitty-gritty details)

In this post, we just seek to showcase the fastest way to do it - and how augmentation may potentially help vary the position!

moelf|2 years ago

any tutorial you would recommend? I found https://comfyanonymous.github.io/ComfyUI_examples/controlnet...

shostack|2 years ago

Is there any solution for consistency yet that goes beyond form and structure and gets things like outfits, color, and facial features consistent in an easy way to compose scenes with multiple consistent characters?

rvion|2 years ago

I'm building CushyStudio https://github.com/rvion/cushystudio#readme to make Stable Diffusion practical and fun to play with.

It's still a bit rough around the corners, and I haven't properly launched it yet, but if you want to play with ControlNets, pre-processors, IP adapters, and all those various SD technologies, it's a pretty fun tool ! I personally use for real-time scribble to image, things like this :)

(will post that properly on HN in a few days / week I think, when early feedback will have been properly addressed)

bavell|2 years ago

Looking forward to your launch, I found cushystudio awhile back (maybe from HN?) and cannibalized some of the type generation code to make my own API wrapper for personal uses. Thanks!

I barely got it working in that early alpha but it was super helpful for me as a reference. I'll give it another go now that it's further along, it seemed very promising and I liked your workflow approach

imranhou|2 years ago

The versatility of Stable Diffusion, especially when combined with tools like ControlNet, highlights the advantages of a more controlled image generation process. While DALL-E and others provide ease and speed, the depth of customization and local processing capabilities of SD models cater to those seeking deeper creative control and independence.

ChatGTP|2 years ago

It is interesting isn't it? Because we have "AI" generating the image, but we still seem to want to "paint" or have control over the creative process.

Prompts seem to be a new type of camera, lens or paintbrush.

Magi604|2 years ago

SD outputs have an "uncanny valley" type of quality to them. You just KNOW when an image is from SD. And I have looked at getting started with SD, but the requirements and setup and +-prompting "language" just kind of turned me off the whole thing.

Whereas with DALL E you can get some hyper-realistic images from it with very little effort using plain human language.

I guess my point is to ask whether SD is worth bothering with at this time when DALL E and Imagen and possibly others are just on the brink of becoming mainstream and just going to get better and better. Clunking together something with SD seems unnecessary when you can generate more results, better results, in a faster way, with less requirements, and without the steep learning curve, by using other methods.

jyap|2 years ago

One major benefit and the reason why I use the StableDiffusion tools and models is because I can run them at home on my relatively old NVIDIA 2080 GPU with 8GB of VRAM. Costs me nothing (besides electricity).

Depends if you value this kind of freedom in life.

You can do some things such as colorizing black and white images with the Recolor model.

https://huggingface.co/stabilityai/control-lora

DrSiemer|2 years ago

You just KNOW when an image is from SD

No, you know when a beginner generated an image in Stable Diffusion. With enough skill and attention, you will not.

Sure, there is a learning curve and it takes more time to get to a good result. But in turn, it gives you control far beyond what the competition can offer.

smcleod|2 years ago

I’m assuming you haven’t used SDXL?

Give it a go with invokeAI - you can create images that I guarantee you wouldn’t know were generated. Like anything (photography included) it’s a skill.

Examples:

  - https://civitai.com/images/2862100
  - https://civitai.com/images/2339666
  - https://civitai.com/images/2846876

vinckr|2 years ago

I can run Stable Diffusion on my local machine. It is open source and weights are public, giving me in theory access to anything I want to modify.

I cant change anything on DALL E, I can just take the input or change the prompt.

Also it is a centralized service that can be shut down, modified, censored or become very expensive at any time.

NBJack|2 years ago

Try SDXL. Find a good negative prompt, then just put a short sentence (starting with the kind of image, such as photograph, render, etc.) describing what you want in the positive prompt. It is much simpler and has fantastic results. Tweak to your hearts desire from there.

If you see a part of the scene that looks weird (and you know what it should be) add it to your prompt. For example, if you want "photo of a jungle in South America", and the foliage looks weird, add something like "with lush trees and ferns".

brucethemoose2|2 years ago

Try: https://github.com/lllyasviel/Fooocus

I also recommend a good photorealistic base model, like RealVis XL.

In my experience its like DALL E but straight up better, more customizable, and local. And thats before you start trying finetunes and LORAs.

Other UIs will do SDXL, but every one I tried is terrible without all those default fooocus augmentations.

raincole|2 years ago

> You just KNOW when an image is from SD.

You don't. People think they do, but they don't.

fassssst|2 years ago

DALL-E within ChatGPT uses GPT-4 to rewrite what you ask for into a good text-to-image prompt. You could probably do something similar with Stable Diffusion with just a little upfront effort tuning that system prompt.

fsloth|2 years ago

Depends what you want.

Dalle 3 is super good, but lacks the creative control controlnets and ip-adapter provide. So for instance afaik there is no way to perform style transfers, or ’paint a van gogh portrait over my pencil sketch’.

Both are good currently but at different things.

”Prompt engineering” is and will be total bs. Dalle3/chatgpt provides the actual workflow we want where we describe to the intelligent agent (chatpgt) what we want and it worries over the accidental-complexity-intricasies of the clip model itself.

zirgs|2 years ago

Dall-E has the same problems as other models. Try generating a clockwork mechanism with it, for example.

SD is worth bothering with because it's open, you can run and extend it yourself.

vidarh|2 years ago

You know when they're bad enough that you know.

orbital-decay|2 years ago

That's funny to hear because DALL-E 3 mainly improves prompt understanding, it hallucinates like mad with faces and hands, and doesn't seem to do anything to improve them like Midjourney for example.

>Whereas with DALL E you can get some hyper-realistic images from it with very little effort using plain human language.

Hyper-realistic, but is it what you want from it? Are you able to guide it into doing exactly what you want? If you have such requirements that just a natural language prompt is enough and is somehow faster than sketching and providing references, of course use it. I'm not so lucky, I don't get what I want from it, and no amount of prompt understanding will make it easier. Although SD/SDXL doesn't pass the quality bar either, not because it's not "detailed" or "hyper-realistic" enough, but because it doesn't pay attention to the things that should be prioritized, like linework or lighting. Neither does any other model. Controlnets and LoRAs alone aren't sufficient for controllability either, mostly because it's too small to understand high-level concepts. So I don't use anything.

ChildOfChaos|2 years ago

Love all this AI stuff, would love to play more with it, but sadly I'm on a 2015 iMac, great for everything else I do but can't do this stuff.

It's pricey to get a windows machine + GPU and the cloud options seem a bit more limited and add up quickly too, but it is amazing tech.

newswasboring|2 years ago

I have done a bunch of stable diffusion stuff on colab. The free version works if you are lucky enough to get a GPU. Used to happen more often before. But the premium colab isn't badly priced either.

Here is a colab link to open comfyUI

https://github.com/FurkanGozukara/Stable-Diffusion/blob/main...

imranq|2 years ago

While SD is pretty interesting, I'm curious what do people use it for? Outside of custom greeting cards and backgrounds, it's not really precise enough for conceptual art nor is it consistent enough for animation.

Filligree|2 years ago

Illustrating my fanfiction.

tayo42|2 years ago

With the luggage example it seems to only generate backgrounds where the lighting makes sense? That's kind of interesting. I was wondering how it would handle the highlight on the right.

minimaxir|2 years ago

Giving Stable Diffusion constraints forces it to get creative.

It’s the best argument against “AI generated images are just collages”.

zamalek|2 years ago

In ComfyUI you could run the image through a style-to-style (sdxl refinement might even pull it off) model to change the lighting without changing the content. Or use another ControlNet. Your workflow can get arbitrarily complex.

Alifatisk|2 years ago

If I have a large dataset or photos with my face? Can I generate my own images in different places and environments using this?

gbrits|2 years ago

Yep. Lora's are the easiest way to go. Loads of tutorials on Youtube. This is a good one: https://www.youtube.com/watch?v=70H03cv57-o

jaggs|2 years ago

Artroom.ai is a great option. Free image generation feature, plus a ton of editing features like layers, zoom out etc.

ComputerGuru|2 years ago

The word ControlNet doesn’t appear even once in the article?

DarthNebo|2 years ago

They did use the Canny ControlNet Pipeline

telegpt|2 years ago

[deleted]

68 comments