Stable Diffusion 2.0

[+] kmeisthax|3 years ago|reply

Is there a good explanation of how to train this from scratch with a custom dataset[0]?

I've been looking around the documentation on Huggingface, but all I could find was either how to train unconditional U-Nets[1], or how to use the pretrained Stable Diffusion model to process image prompts (which I already know how to do). Writing a training loop for CLIP manually wound up with me banging against all sorts of strange roadblocks and missing bits of documentation, and I still don't have it working. I'm pretty sure I also need some other trainables at some point, too.

[0] Specifically, Wikimedia Commons images in the PD-Art-100 category, because the images will be public domain in the US and the labels CC-BY-SA. This would rule out a lot of the complaints people have about living artists' work getting scraped into the machine; and probably satisfy Debian's ML guidelines.

[1] Which actually does work

[+] TaylorAlexander|3 years ago|reply

Ah I am glad to see someone else talking about using public domain images!

Honestly it baffles me that in all this discussion, I rarely see people discussing how to do this with appropriately licensed images. There are some pretty large datasets out there of public images, and doing so might even help encourage more people to contribute to open datasets.

Also if the big ML companies HAD to use open images, they would be forced to figure out sample efficiency for these models. Which is good for the ML community! They would also be motivated to encourage the creation of larger openly licensed datasets, which would be great. I still think if we got twitter and other social media sites to add image license options, then people who want to contribute to open datasets could do so in an easy and socially contagious way. Maybe this would be a good project for mastodon contributors, since that is something we actually have control over. I'd be happy to license my photography with an open license!

It is really a wonderful idea to try to do this with open data. Maybe it won't work very well with current techniques, but that just becomes an engineering problem worth looking at (sample efficiency).

[+] __rito__|3 years ago|reply

Well, you can learn about generative models from MOOCs like the ones taught at UMich, Universitat Tubingen, or New York University (taught by Yann LeCun), and can gain knowledge there.

You can also watch the fast.ai MOOC titled Deep Learning from Scratch to Stable Diffusion [0].

You can also look at open source implementation of text2image models like Dall-E Mini or the works of lucid rain.

I worked on the Dall-E Mini project, and the technical knowhow that you need isn’t closely taught at MOOCs. You need to know, on top of Deep Learning theory, many tricks, gotchas, workarounds, etc.

You could follow the works of Eluther AI, follow Boris Dayma (project leader of Dall-E Mini) and Horace Ho on twitter. And any such people who have significant experience in practical AI and regularly share their tricks. The PyTorch forums is also a good place.

Learn PyTorch and/or JAX/Flax really well.

[0]: https://www.fast.ai/posts/part2-2022.html

[+] echelon|3 years ago|reply

> train this from scratch

If you're talking about training from scratch and not fine tuning, that won't be cheap or easy to do. You need thousands upon thousands of dollars of GPU compute [1] and a gigantic data set.

I trained something nowhere near the scale of Stable Diffusion on Lambda Labs, and my bill was $14,000.

[1] Assuming you rent GPUs hourly, because buying the hardware outright will be prohibitively expensive.

[+] TeMPOraL|3 years ago|reply

> Specifically, Wikimedia Commons images in the PD-Art-100 category, because the images will be public domain in the US and the labels CC-BY-SA.

Doesn't the "BY" part of the license mean you have to provide attribution along with your models' output[0]? I feel you'll have the equivalent of Github Copilot problem: it might be prohibitive to correctly attribute each output, and listing the entire dataset in attribution section won't fly either. And if you don't attribute, your model is no different than Stable Diffusion, Copilot and other hot models/tools: it's still a massive copyright violation and copyright laundering tool.

----

[0] - https://creativecommons.org/licenses/by-sa/4.0/

[+] ShamelessC|3 years ago|reply

> Writing a training loop for CLIP manually wound up with me banging against all sorts of strange roadblocks and missing bits of documentation, and I still don't have it working.

There is working training code for openCLIP https://github.com/mlfoundations/open_clip

But training multi-modal text-to-image models is still a _very_ new thing, in terms of the software world. Given that, my experience has been that it's never been easier to get to work on this stuff from the software POV. The hardware is the tricky bit (and preventing bandwidth issues on distributed systems).

That isn't to say that there isn't code out there for training. Just that you're going to run into issues and learning how to solve those issues as you encounter them is going to be a highly valuable skill soon.

edit:

I'm seeing in a sibling comment that you're hoping to train your own model from scratch on a single GPU. Currently, at least, scaling laws for transformers [0] mean that the only models that perform much of anything at all need a lot of parameters. The bigger the better - as far as we can tell.

Very simply - researchers start by making a model big enough to fill a single GPU. Then, they replicate the model across hundreds/thousands of GPU's, but feed each on a different set of the data. Model updates are then synchronized, hopefully taking advantage of some sort of pipelining to avoid bottlenecks. This is referred to as data-parallel.

[0] https://www.lesswrong.com/tag/scaling-laws

[+] sabalaba|3 years ago|reply

Here’s a tutorial on how to fine tune stable diffusion form the guy who made text-to-pokemon:

https://lambdalabs.com/blog/how-to-fine-tune-stable-diffusio...

[+] Der_Einzige|3 years ago|reply

A paper was already presented at this workshop at COLING 2022 by Nvidia which already does this

https://arxiv.org/abs/2209.14697

[+] j0hnM1st|3 years ago|reply

It will be worthwhile to use images from commons. I have found that my photography is used in the stable diffusion data set. What was funny is that they have taken the images from other URLs than my flickr account.

[+] unknown|3 years ago|reply

[deleted]

[+] LASR|3 years ago|reply

I am a solo dev working on a creative content creation app to leverage the latest developments in AI.

Demoing even the v1 of stable diffusion to the non-technical general users blows them away completely.

Now that v2 is here, it’s clear we’re not able to keep pace in developing products to take advantage of it.

The general public still is blown away by autosuggest in mobile OS keyboards. Very few really know how far AI tech has evolved.

Huge market opportunity for folks wanting to ride the wave here.

This is exciting for me personally, since I can keep plugging in newer and better versions of these models into my app and it becomes better.

Even some of the tech folks I demo my app to, are simply amazed how I can manage to do this solo.

[+] minimaxir|3 years ago|reply

GitHub Repo: https://github.com/Stability-AI/stablediffusion

HuggingFace Space (currently overloaded unsurprisingly): https://huggingface.co/spaces/stabilityai/stable-diffusion

Doing a 2.0 release on a (US) 2-day holiday weekend is an interesting move.

It seems a tad more difficult to set up the model than the previous version.

[+] in3d|3 years ago|reply

In addition to removing NSFW images from the training set, this 2.0 release apparently also removed commercial artist styles and celebrities [1]. While it should be possible to fine tune this model to create them anyway using DreamBooth or a similar approach, they clearly went for the safe route after taking some heat.

1. https://twitter.com/emostaque/status/1595731407095140352?s=4...

[+] machina_ex_deus|3 years ago|reply

I predicted back when they started backpedaling that there's a chance that sd1.4 or 1.5 will be the best available model to the general public, for a very long duration, because the backlash will force them to self-castrate themselves.

You can see nobody likes this new model in any of the stable diffusion communities. It's a big flop and for a good reason. The reason it was so successful in the first place was because you could combine artist names to get the model to the outcome you want.

I'll again remind anyone who thinks they might want to use this to download a working version of SD now. They might break their own libraries in the future, and getting SD1.4 could be a real hassle in a year or so. Getting the right .ckpt file, which can have pickled python malware, is not so trivial, and this will get worse in time.

It's going to diverge into castrated official model that intentionally breaks the older models and older models from unofficial shady sources that might contain malware.

[+] choxi|3 years ago|reply

Mixing artist names was by far the most effective way to create aesthetically pleasing images, this is a huge change. DreamBooth can only fine-tune on a couple dozen images, and you can't train multiple new concepts in one model, but maybe someone will do a regular fine-tune or train a new model.

[+] astrange|3 years ago|reply

This is extremely misleading and you seem to have confused all the other replies.

StableDiffusion 1.0 used CLIP released by OpenAI. 2.0 uses a CLIP retrained from scratch by Stability.

We don’t know OpenAI’s dataset so don’t know what was in it or how to recreate it. Nothing was “removed”.

[+] CuriouslyC|3 years ago|reply

Removing NSFW content is fine, people who care about that can work around it easily. Removing celebrities and commercial artists was a mistake though and I expect this will need to be really impressive in other ways or people aren't going to bother using it.

[+] arez|3 years ago|reply

does this mean that stuff like artstation and deviantart doesn't work anymore as prompts? That would be a huge change

[+] poisonarena|3 years ago|reply

this is really disappointing

[+] liuliu|3 years ago|reply

Seems the structure of UNet hasn't changed other than the text encoder input (768 to 1024). The biggest change is on the text encoder, switched from ViT-L14 to ViT-H14 and fine-tuned based on https://arxiv.org/pdf/2109.01903.pdf.

Seems the 768-v model, if used properly, can substantially speed-up the generation, but not exactly sure yet. Seems straightforward to switch to 512-base model for my app next week.

[+] nl|3 years ago|reply

Highlights:

768x768 native models (v1.x maxed out at 512x512)

a built-in 4x upscaler: "Combined with our text-to-image models, Stable Diffusion 2.0 can now generate images with resolutions of 2048x2048–or even higher."

Depth-to-Image Diffusion Model: "infers the depth of an input image, and then generates new images using both the text and depth information." Depth-to-Image can offer all sorts of new creative applications, delivering transformations that look radically different from the original but which still preserve the coherence and depth of that image (see the demo gif if you haven't looked)

Better inpainting model

Trained with a stronger NSFW filter on training data.

For me the depth-to-image model is a huge highlight and something I wasn't expecting. The NSFW filter is a nothing (it's trivially easy to fine-tune the model on porn if you want, and porn collections are surprisingly easy to come by...).

The higher resolution features are interesting. HuggingFace has got the 1.x models working for inference in under 1G of VRAM, and if those optimizations can be preserved it opens up a bunch of interesting possibilities.

[+] tluyben2|3 years ago|reply

> it's trivially easy to fine-tune the model on porn if you want, and porn collections are surprisingly easy to come by

Not really surprised they did this, but be sure some communities will have it fine tuned on porn now-ish. So probably they did it for legal reasons in case illegal materials are generated and they are real companies/people with their names on the release?

[+] imran-khan|3 years ago|reply

To put things in perspective, the dataset it's trained on is ~240TB and Stability has over ~4000 Nvidia A100 (which is much faster than a 1080ti). Without those ingredients, you're highly unlikely to get a model that's worth using (it'll produce mostly useless outputs).

That argument also makes little sense when you consider that the model is a couple gigabytes itself, it can't memorize 240TB of data, so it "learned".

But if you want to create custom versions of SD, you can always try out dreambooth: https://github.com/XavierXiao/Dreambooth-Stable-Diffusion, that one is actually feasible without spending millions of dollars on GPUs.

[+] prawn|3 years ago|reply

Hopefully related: If I'm a photographer wanting to improve resolution of my content for printing, what's my current best bet for upscaling?

Is it realistic to make use of this on the command line, feeding it my own images? Or has someone wrapped it in an app or online service?

[+] rogers18445|3 years ago|reply

They apparently tried to combat NSFW generation by filtering the training dataset not to include any.

[+] bloaf|3 years ago|reply

They know they are going to be the next target in the war on general purpose computing. They're trying to stave it off for as long as possible by signalling to the authorities that they are the good guys.

A confrontation is inevitable, though. Right now it costs moderate sums of money to do this level of training. Not always will this be so. If I were an AI-centric organization, I would be racing to position myself as a trustworthy actor in my particular corner of the AI space so that when legislators start asking questions about the explosion of bad actors, I can engage in a little bit of regulatory capture, and have the legislators legislate whatever regulations I've already implemented, to the disadvantage of my competitors.

For people who say "people can make whatever images they like in photoshop," I will remind you of this: https://i.imgur.com/5DJrd.jpg

[+] minimaxir|3 years ago|reply

In practice, it's unclear how well avoiding training on NSFW images will work: the original LAION-400M dataset used for both SD versions did filter out some of the NSFW stuff, and it appears SD 2.0 filters out a bit more. The use of OpenCLIP in SD 2.0 may also prevent some leakage of NSFW textual concepts compared to OpenAI's CLIP.

It will, however, definitely not affect the more-common use case of anime women with very large breasts. And people will be able to finetune SD 2.0 on NSFW images anyways.

[+] satvikpendem|3 years ago|reply

They reportedly did so to stop people from generating CSAM [0].

[0] https://old.reddit.com/r/StableDiffusion/comments/y9ga5s/sta...

[+] argella|3 years ago|reply

Bummer. AI porn is fun.

[+] kristopolous|3 years ago|reply

Did they exclude celebrities, politicians, and religious and political symbols?

Deceitful extremists and vengeful criminals fabricating lies seem to be a far more serious problem than fantasy porno.

[+] cma|3 years ago|reply

(Edit: it may have removed that wording now: https://github.com/Stability-AI/stablediffusion/commit/ca86d... )

They can force model upgrades too:

> The New AI Model Licenses Have a Legal Loophole (OpenRAIL-M of Stable Diffusion)

https://www.youtube.com/watch?v=W5M-dvzpzSQ

[+] myaccount9786|3 years ago|reply

The easiest way to combat this is to put your model behind an API and filter queries (midjourney, OpenAI) or just not make it available (Google). The tradeoff is that you're paying for everyone's compute.

I guess SD is betting on saving $ on compute being more important in this space than the ability to gatekeep certain queries. And the tradeoff is that you need to do nsfw filtering in your released model.

It will be interesting to see who's right in 2 years.

[+] throwup|3 years ago|reply

You can generate all the bloody violent gore you like, but god forbid anybody see a human body in its natural state

[+] tormeh|3 years ago|reply

I can't see any progress on AMD/Intel GPU support :( Would love to see Vulkan or at least ROCm support. With SD1 you could follow some guides online to make it work, since PyTorch itself supports ROCm, but the state of non-Nvidia GPU support in the DL space is quite sad.

[+] charcircuit|3 years ago|reply

I dislike how they call their model open source even though there are restrictions on how you can use the model. The ability to use code however you want and not have to worry about if all the code you are using is compatible with your use case is a key part of open source.

[+] acidburnNSA|3 years ago|reply

Awesome. I'm installing on Ubuntu 22.04 right now.

Ran into a few errors with the default instructions related to CUDA version mismatches with my nvidia driver. Now I'm trying without conda at all. Made a venv. I upgraded to the latest that Ubuntu provides and then downloaded and installed the appropriate CUDA from [1].

That got me farther. Then ran into the fact that the xformers binaries I had in my earlier attempts is now incompatible with my current drivers and CUDA, so rebuiding that one. I'm in the 30-minute compile, but did the `pip install ninja` as recommended by [2] and it's running on a few of my 32 threads now. Ope! Done in 5 mins. Test info from `python -m xformers.info` looks good.

Damn still hitting CUDA out of memory issues. I knew I should have bought a bigger GPU back in 2017. Everyone says I have to downgrade pytorch to 1.12.1 for this to not happen. But oh dang that was compiled with a different cuda, oh groan. Maybe I should get conda to work afterall.

`torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 5.93 GiB total capacity; 5.62 GiB already allocated; 15.44 MiB free; 5.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`

Guess I better go read those docs... to be continued.

[1] https://developer.nvidia.com/cuda-downloads?target_os=Linux&...

[2] https://github.com/facebookresearch/xformers

[+] stavros|3 years ago|reply

It kind of annoys me that they removed NSFW images from the training set. Not because I want to generate porn (though some people do), but because I feel that they're foisting a puritan ethic on me. I don't consider the naked body inherently bad, and I don't like seeing new technology carry this (wrong, in my opinion) stigma.

Then again, it's their model, they can do whatever they want with it, but it still leaves me with a weird feeling.

[+] sorenjan|3 years ago|reply

I've seen references to merging models together to be able to generate new kinds of imagery or styles, how does that work? I think you use Dreambooth to make specialized models, and I think I got an idea about how that basically assigns a name to a vector in the latent space that represents the thing you want to generate new imagery of, but can you generate multiple models and blend them together?

Edit: Looks like AUTOMATIC1111 can merge three checkpoints. I still don't know how it works technically, but I guess that's how it's done?

https://github.com/AUTOMATIC1111/stable-diffusion-webui

[+] 88stacks|3 years ago|reply

Awesome, I’ve put stable diffusion on an api to train a model for anyone to use for free. I’m adding 2.0 to it as we speak! https://88stacks.com

[+] kanonieer|3 years ago|reply

Interesting project but terrible naming.

[+] pilaf|3 years ago|reply

> to sue for free

Typo of freudian slip?

Just kidding of course, nice project!

[+] dimaor|3 years ago|reply

How is this free? Is it possible to download the checkpoints?

I'm asking because I'm running SD locally but my GPU is not good enough to train new checkpoints and while I get the time to work on improve I wanted to use this API in order to generate some models for an illustration book I am working on.

[+] Tepix|3 years ago|reply

You really need a privacy policy!

Ideally one that states that the uploaded images are deleted after generating the model and not used for anything else in any fashion whatsoever.

Also,let people download the models and delete them afterwards with the same handling. Then it gets very interesting indeed!

[+] julienmarie|3 years ago|reply

Interesting concept!

[+] dzink|3 years ago|reply

The upscaler link on this page https://huggingface.co/stabilityai/stable-diffusion-x4-upsca... is broken. Here is a working link https://huggingface.co/stabilityai/stable-diffusion-x4-upsca...

[+] Scandiravian|3 years ago|reply

What's the potential of using this for image restoration? I've been looking into this recently as I've found a ton of old family photos, that I'd like to digitize and repair some of the damage on them

There are a lot of tools available, but I haven't found anything where the result isn't just another kind of bad, so if the upscaling and inference in this model is good, it should in theory be possible to restore images by using the old photos as the seed, right?

[+] serpix|3 years ago|reply

Current Stable Diffusion's (1.4 and 1.5) img2img can accomplish a lot of this.

[+] WatchDog|3 years ago|reply

Are the GPU memory requirements different for this release?

Is now it possible to generate higher resolution images with less memory?

[+] coldblues|3 years ago|reply

I just thought about this, so bare in mind that I don't know much of the technical implications of this, but:

Couldn't we train a very good model by distributing the dataset along with the computing power using something similar to folding@home?

[+] s1k3s|3 years ago|reply

Is there any place where we can learn more about all these AI tools that keep popping up, that is not marketing speak? Also, I see the words 'open' and 'open source' and yet they all require me to sign up to some service, join some beta program, buy credits etc. Are they open source?

493 comments