ControlNet model specifically the scribble ControlNet (and ComfyUI) was major gamechanger for me.
Was getting good results with just SD and occassional masking but it would take hours and hours to hone in and composite a complex scene with specific requirements & shapes (with most of the work spent currating the best outputs and then blending them into a scene with Gimp/Inkscape).
Masking is unintuitive compared to the scribble which gets similar effect; no need to paint masks (which is disruptive to the natural process of 'drawing' IMO) instead just make a general black and white outline of your scene. Simply dial up/down the conditioning strength to have it more tightly or fuzzily follow that outline.
You can also use Gimp's Threshold or Inkscape Trace Bitmap tool to get a decent black & white outline from an existing bitmap to expedite the scribble procedure.
Comfy ui is really nice. The fact that the node graph is saved as png metadata actually makes node based workflows super fluent and reproducible since all you need to do to get the graph for your image is to drag and drop the result png to the gui. This feels like a huge quality-of-life improvement compared to any other lightweight node tools I’ve used.
You don't need to go through Gimp or Inkscape, this is built-in to the auto1111 ControlNet UI. You just dump the existing photo there and you can select a bunch of pre-processors like edge-detection or 3D depth extraction, which is then fed into ControlNet to generate a new image.
This is super powerful for example visualizing the renovation of an apartment room or house exterior.
That's for sure - I think we have seen other kind of edge detector or filter work better for differing use cases, especially around foreground images where you want to retain more information (i.e. images with small nitty-gritty details)
In this post, we just seek to showcase the fastest way to do it - and how augmentation may potentially help vary the position!
Is there any solution for consistency yet that goes beyond form and structure and gets things like outfits, color, and facial features consistent in an easy way to compose scenes with multiple consistent characters?
It's still a bit rough around the corners, and I haven't properly launched it yet, but if you want to play with ControlNets, pre-processors, IP adapters, and all those various SD technologies, it's a pretty fun tool ! I personally use for real-time scribble to image, things like this :)
(will post that properly on HN in a few days / week I think, when early feedback will have been properly addressed)
Looking forward to your launch, I found cushystudio awhile back (maybe from HN?) and cannibalized some of the type generation code to make my own API wrapper for personal uses. Thanks!
I barely got it working in that early alpha but it was super helpful for me as a reference. I'll give it another go now that it's further along, it seemed very promising and I liked your workflow approach
The versatility of Stable Diffusion, especially when combined with tools like ControlNet, highlights the advantages of a more controlled image generation process. While DALL-E and others provide ease and speed, the depth of customization and local processing capabilities of SD models cater to those seeking deeper creative control and independence.
SD outputs have an "uncanny valley" type of quality to them. You just KNOW when an image is from SD. And I have looked at getting started with SD, but the requirements and setup and +-prompting "language" just kind of turned me off the whole thing.
Whereas with DALL E you can get some hyper-realistic images from it with very little effort using plain human language.
I guess my point is to ask whether SD is worth bothering with at this time when DALL E and Imagen and possibly others are just on the brink of becoming mainstream and just going to get better and better. Clunking together something with SD seems unnecessary when you can generate more results, better results, in a faster way, with less requirements, and without the steep learning curve, by using other methods.
One major benefit and the reason why I use the StableDiffusion tools and models is because I can run them at home on my relatively old NVIDIA 2080 GPU with 8GB of VRAM. Costs me nothing (besides electricity).
Depends if you value this kind of freedom in life.
You can do some things such as colorizing black and white images with the Recolor model.
No, you know when a beginner generated an image in Stable Diffusion. With enough skill and attention, you will not.
Sure, there is a learning curve and it takes more time to get to a good result. But in turn, it gives you control far beyond what the competition can offer.
Give it a go with invokeAI - you can create images that I guarantee you wouldn’t know were generated. Like anything (photography included) it’s a skill.
Try SDXL. Find a good negative prompt, then just put a short sentence (starting with the kind of image, such as photograph, render, etc.) describing what you want in the positive prompt. It is much simpler and has fantastic results. Tweak to your hearts desire from there.
If you see a part of the scene that looks weird (and you know what it should be) add it to your prompt. For example, if you want "photo of a jungle in South America", and the foliage looks weird, add something like "with lush trees and ferns".
DALL-E within ChatGPT uses GPT-4 to rewrite what you ask for into a good text-to-image prompt. You could probably do something similar with Stable Diffusion with just a little upfront effort tuning that system prompt.
Dalle 3 is super good, but lacks the creative control controlnets and ip-adapter provide. So for instance afaik there is no way to perform style transfers, or ’paint a van gogh portrait over my pencil sketch’.
Both are good currently but at different things.
”Prompt engineering” is and will be total bs. Dalle3/chatgpt provides the actual workflow we want where we describe to the intelligent agent (chatpgt) what we want and it worries over the accidental-complexity-intricasies of the clip model itself.
That's funny to hear because DALL-E 3 mainly improves prompt understanding, it hallucinates like mad with faces and hands, and doesn't seem to do anything to improve them like Midjourney for example.
>Whereas with DALL E you can get some hyper-realistic images from it with very little effort using plain human language.
Hyper-realistic, but is it what you want from it? Are you able to guide it into doing exactly what you want? If you have such requirements that just a natural language prompt is enough and is somehow faster than sketching and providing references, of course use it. I'm not so lucky, I don't get what I want from it, and no amount of prompt understanding will make it easier. Although SD/SDXL doesn't pass the quality bar either, not because it's not "detailed" or "hyper-realistic" enough, but because it doesn't pay attention to the things that should be prioritized, like linework or lighting. Neither does any other model. Controlnets and LoRAs alone aren't sufficient for controllability either, mostly because it's too small to understand high-level concepts. So I don't use anything.
I have done a bunch of stable diffusion stuff on colab. The free version works if you are lucky enough to get a GPU. Used to happen more often before. But the premium colab isn't badly priced either.
While SD is pretty interesting, I'm curious what do people use it for? Outside of custom greeting cards and backgrounds, it's not really precise enough for conceptual art nor is it consistent enough for animation.
With the luggage example it seems to only generate backgrounds where the lighting makes sense? That's kind of interesting. I was wondering how it would handle the highlight on the right.
In ComfyUI you could run the image through a style-to-style (sdxl refinement might even pull it off) model to change the lighting without changing the content. Or use another ControlNet. Your workflow can get arbitrarily complex.
drschwabe|2 years ago
Was getting good results with just SD and occassional masking but it would take hours and hours to hone in and composite a complex scene with specific requirements & shapes (with most of the work spent currating the best outputs and then blending them into a scene with Gimp/Inkscape).
Masking is unintuitive compared to the scribble which gets similar effect; no need to paint masks (which is disruptive to the natural process of 'drawing' IMO) instead just make a general black and white outline of your scene. Simply dial up/down the conditioning strength to have it more tightly or fuzzily follow that outline.
You can also use Gimp's Threshold or Inkscape Trace Bitmap tool to get a decent black & white outline from an existing bitmap to expedite the scribble procedure.
fsloth|2 years ago
l33tman|2 years ago
This is super powerful for example visualizing the renovation of an apartment room or house exterior.
gkeechin|2 years ago
In this post, we just seek to showcase the fastest way to do it - and how augmentation may potentially help vary the position!
moelf|2 years ago
shostack|2 years ago
rvion|2 years ago
It's still a bit rough around the corners, and I haven't properly launched it yet, but if you want to play with ControlNets, pre-processors, IP adapters, and all those various SD technologies, it's a pretty fun tool ! I personally use for real-time scribble to image, things like this :)
(will post that properly on HN in a few days / week I think, when early feedback will have been properly addressed)
bavell|2 years ago
I barely got it working in that early alpha but it was super helpful for me as a reference. I'll give it another go now that it's further along, it seemed very promising and I liked your workflow approach
imranhou|2 years ago
ChatGTP|2 years ago
Prompts seem to be a new type of camera, lens or paintbrush.
Magi604|2 years ago
Whereas with DALL E you can get some hyper-realistic images from it with very little effort using plain human language.
I guess my point is to ask whether SD is worth bothering with at this time when DALL E and Imagen and possibly others are just on the brink of becoming mainstream and just going to get better and better. Clunking together something with SD seems unnecessary when you can generate more results, better results, in a faster way, with less requirements, and without the steep learning curve, by using other methods.
jyap|2 years ago
Depends if you value this kind of freedom in life.
You can do some things such as colorizing black and white images with the Recolor model.
https://huggingface.co/stabilityai/control-lora
DrSiemer|2 years ago
No, you know when a beginner generated an image in Stable Diffusion. With enough skill and attention, you will not.
Sure, there is a learning curve and it takes more time to get to a good result. But in turn, it gives you control far beyond what the competition can offer.
smcleod|2 years ago
Give it a go with invokeAI - you can create images that I guarantee you wouldn’t know were generated. Like anything (photography included) it’s a skill.
Examples:
vinckr|2 years ago
I cant change anything on DALL E, I can just take the input or change the prompt.
Also it is a centralized service that can be shut down, modified, censored or become very expensive at any time.
NBJack|2 years ago
If you see a part of the scene that looks weird (and you know what it should be) add it to your prompt. For example, if you want "photo of a jungle in South America", and the foliage looks weird, add something like "with lush trees and ferns".
brucethemoose2|2 years ago
I also recommend a good photorealistic base model, like RealVis XL.
In my experience its like DALL E but straight up better, more customizable, and local. And thats before you start trying finetunes and LORAs.
Other UIs will do SDXL, but every one I tried is terrible without all those default fooocus augmentations.
raincole|2 years ago
You don't. People think they do, but they don't.
fassssst|2 years ago
fsloth|2 years ago
Dalle 3 is super good, but lacks the creative control controlnets and ip-adapter provide. So for instance afaik there is no way to perform style transfers, or ’paint a van gogh portrait over my pencil sketch’.
Both are good currently but at different things.
”Prompt engineering” is and will be total bs. Dalle3/chatgpt provides the actual workflow we want where we describe to the intelligent agent (chatpgt) what we want and it worries over the accidental-complexity-intricasies of the clip model itself.
zirgs|2 years ago
SD is worth bothering with because it's open, you can run and extend it yourself.
vidarh|2 years ago
orbital-decay|2 years ago
>Whereas with DALL E you can get some hyper-realistic images from it with very little effort using plain human language.
Hyper-realistic, but is it what you want from it? Are you able to guide it into doing exactly what you want? If you have such requirements that just a natural language prompt is enough and is somehow faster than sketching and providing references, of course use it. I'm not so lucky, I don't get what I want from it, and no amount of prompt understanding will make it easier. Although SD/SDXL doesn't pass the quality bar either, not because it's not "detailed" or "hyper-realistic" enough, but because it doesn't pay attention to the things that should be prioritized, like linework or lighting. Neither does any other model. Controlnets and LoRAs alone aren't sufficient for controllability either, mostly because it's too small to understand high-level concepts. So I don't use anything.
ChildOfChaos|2 years ago
It's pricey to get a windows machine + GPU and the cloud options seem a bit more limited and add up quickly too, but it is amazing tech.
newswasboring|2 years ago
Here is a colab link to open comfyUI
https://github.com/FurkanGozukara/Stable-Diffusion/blob/main...
imranq|2 years ago
Filligree|2 years ago
tayo42|2 years ago
minimaxir|2 years ago
It’s the best argument against “AI generated images are just collages”.
zamalek|2 years ago
Alifatisk|2 years ago
gbrits|2 years ago
jaggs|2 years ago
ComputerGuru|2 years ago
DarthNebo|2 years ago
telegpt|2 years ago
[deleted]