I think most are vastly underestimating the impact of Synthetic AI media - this is at least as big as the invention of film/photography (century-level shift) and maybe as big as the invention of writing (millenia-level shift). Once you really think through the consequences of the collapse of idea and execution, you likely to tend think the latter...
Other than "you can't trust anything you don't see with your own eyes", what kind of shift is it? People lived like that for literally millennia before photography, audio, and video recording.
At absolute worst, we are only undoing about 150 years of development, and only "kind of" and only in certain scenarios.
Moreover, people were making convincing edits of street signs, etc. literally 20 years ago using just Photoshop. What does this really change at a fundamental level? Okay, so you can mimic voices and generate video, rather than just static images. But people have been making hoax recordings and videos for longer than we've had computers.
I think the effects of this stuff will be: 1) making it easier/cheaper to create certain forms of art and entertainment media, 2) making it easier to create hoaxes, and 3) we will eventually need to contend with challenges to IP law. That's about it. I think it will create a lot of value for a lot of people (sibling comment makes a good point about this being equivalent to CGI), but I don't see the big societal shift you're claiming that this is.
Take a look at https://www.reddit.com/r/midjourney/ if you want to see what midjourney is capable of. Some of them are extremely impressive [1][2][3][4]
Still, humans use art to communicate intent, and we still consider AIs to be 'things' , no agency or intent. Being an artist just became a lot harder, because no amount of technical prowess can make you stand out. It s all about the narrative now
I think what we have is a toy and will remain a toy, just like Eliza was 60 years ago. Academically fascinating, and given the constraints of the era, genuinely remarkable, but still a long way from really being useful.
I'm already getting bored of seeing 95% amazing 5% wtf AI generated images, I can't fathom how anyone else remains excited about this stuff so long. My slack is filled with impressive-but-not-quite-right images of all sorts of outrageous scenarios.
But that's the catch. These diffusion models are stuck creating wacky or surreal images because those contexts are essential allowing you to easily ignore how much these generates miss the mark.
Synthetic AI media won't even been as disruptive as photoshop, let alone the creation of written language.
This thought occurred to me recently while skimming the formulaic and indistinguishable programming on Netflix. It won't be long before a GPT-3 script is fed to an image generator and out comes the components of a movie or TV show. The product will undoubtedly need some human curation and voice acting, but the possibility of a one-person production studio is on the horizon.
> I've been making a completely Midjourney-generated "interactive" film called SALT
I stumbled over Midjourney the other day through these music videos[1][2] generated by Midjourney from the songs lyrics, and I immediately thought we're not far away from this being viable for a cartoon-like film.
It’s really crazy how Stable Diffusion seems to be very on par with DALL-E and you can run it on “most” hardware. Is there an equivalent for GPT-3? I don’t even think I can run the 2M lite GPT-J on my computer…
7 days and already that many UIs, plugins and integrations released. To be fair, developer/researcher access was a bit earlier but that is impressive adoption speed.
Tangent discussion: What are people's here experiences with running Stable Diffusion locally? I've installed it and haven't had time to play around, but I also have a RTX 3060 8GB GPU -- IIRC, the official SD docs say that 10GB is the minimum, but I've seen posts/articles saying it could be done with 8GB.
Mostly I'm interested in the processing time. Like, using a midrange desktop, what's the average time to expect SD to produce an image from a prompt? Minutes/Tens of minutes/Hours?
Keep in mind to have the batch-size low (equal to 1, probably), that was my main issue when I first installed this.
Then, there's lot's of great forks already which add an interactive repl or web ui [0][1]. They also run with half-precision which saves a few bytes. Additionally, they optionally integrate with upscaling neural networks, which means you can generate 512x512 images with stable diffusion and then scale them up to 1024x1024 easily. Moreover, they optionally integrate with face-fixing neural networks, which can also drastically improve the quality of images.
There's also this ultra-optimized repo, but it's a fair bit slower [2].
ASUS Zephyrus G15 (GA503QM) with a laptop 3060 (95W, I think) with 6GB of VRAM, basujindal fork, does 512×512 at about 3.98 iterations per second in turbo mode (for which there’s plenty of memory at that size). That’s under 15 seconds per image on even small batches at the default 50 steps, and I think it was only using around 4.5GB of VRAM.
(I say “I think” because I’ve uninstalled the nvidia-dkms package again while I’m not using it because having a functional NVIDIA dual-GPU system in Linux is apparently too annoying: Alacritty takes a few seconds to start because it blocks on spinning up the dGPU for a bit for some reason even though it doesn’t use it, wake from sleep takes five or ten seconds instead of under one second, Firefox glyph and icon caches for individual windows occasionally (mostly on wake) get blatted (that’s actually mildly concerning, though so long as the memory corruption is only in GPU memory it’s probably OK), and if the nvidia modules are loaded at boot time Sway requires --unsupported-gpu and my backlight brightness keys break because the device changes in the /sys tree and I end up with an 0644 root:root brightness file instead of the usual 0664 root:video, and I can’t be bothered figuring it out or arranging a setuid wrapper or whatever. Yeah, now I’m remembering why I would have preferred a single-GPU laptop, to say nothing of the added expense of a major component that had gone completely unused until this week. But no one sells what I wanted without a dedicated GPU for some reason.)
Removing the NSFW and watermark modules from the model will easily allow you to run it with 8 GB VRAM (usually takes around 6.9 GB for 512x512 generations).
With an RTX 3060, your average image generation time is going to be around 7-11 seconds if I recall correctly. This swings wildly based on how you adjust different settings, but I doubt you'll ever require more than 70 seconds to generate an image.
I'm using the fork at https://github.com/basujindal/stable-diffusion which is optimized for lower VRAM usage. My RTX 2070 (8 GB) takes about 90 seconds to generate a batch of 4 images.
It's pretty fast on a RTX 3070 (8GB), a few seconds per image.
My first impression is it seems a lot more useful then DALL-E, because you can quickly iterate on prompts, and also generate many batches, picking the best ones. To get something that's actually usable, you'll have to tinker around a bit and give it a few tries. With DALL-E, feedback is slower, and there's reluctance to just hammer prompts because of credits.
With this, it takes 30s to generate a 768x512 image (any larger and I am experiencing OOM issues again). I think you should expect a bit faster at the same resolution with your 3060 because it's a faster card with the same amount of memory.
I have a Titan X (Pascal) from like 2015 with 12GB of vram and I've had no trouble running it locally. I'd say it takes me about 30 seconds maybe to generate a single image on a 30ddim (which is like the bare minimum I consider for quick iterations), when I want to get more quality images after I focus on a proper prompt, I set it to like 100 or 200 ddim and that maybe takes 1 minute for one picture (I didn't accurately measure). I usually just let it run for a few minutes in bulk of 10 or 20 pictures while I go do something else then come back half 15-20 minutes later.
It runs pretty well but the most I can get is a 768x512 image, but it's pretty good for stuff like visual novel background art[0] and similar things.
AI generated art is interesting and will probably be helpful.
I see it as a cheap and fast alternative to paying a concept artist.
But not a revolution.
Creating precise and coherent assets is going to be a challenge, at least with the current architecture.
From a research perspective this is, I think, much more than a toy, those models can help us better understand the nature of our minds, especially related to their processing of text, images and abstraction.
This is what a revolution looks like when it is happening.
Did you learn about the "Industrial revolution" or the "agricultural revolution" in class? That didn't take a week, or a year, or a decade to happen. Even the Internet revolution took more than a decade.
This is a revolution. And you're seeing it happen in real time.
I think what it shows us that activities that we think of as "human", like getting drunk, saying silly things that sound brilliant, or painting things that look stunning are actually the things that a machine has least trouble to copy.
Whereas things we associate more with computers, such as hard thinking, mathematics, etc. turn out to be more difficult to copy by a machine, and therefore perhaps more "human".
I've dismissed DALL-E - very cool, but won't really replace everyone. After playing with Stable Diffusion, as an artist, this is the most profound experience I've ever had with a computer. Check this out https://andys.page/posts/how-to-draw/
After playing with it for a few hours I'm sold on it soon replacing all blog spam media and potentially flooding etsy with "artists" trying to pass the renders as their own art work.
It must be interesting being a graphic artist in 2020-2022. First NFT's that enabled some to make millions of dollars. Less than 2 years later, Stable Diffusion, which will probably shrink the market significantly for human graphical artists.
I guess we know where the new market for all those Ethereum miners' GPUs will come from. I have always been sort of bear-ish on the trend towards throwing GPU power at neural nets and their descendants, but clearly there are amazing applications for this tech. I still think it's morally kinda wrong to copy an artist's style using an automated tool like this, but I guess we'll have to deal with that because there's no putting this genie back in the bottle.
Just imagine - you could write your own script of a series and have it realistically generated, especially cartoons, complete with voice acting. Popular generated Spongebob episodes could form canonical entries in the mind of the general public - after some information fallout, original episodes couldn't be even told apart. Postmodern pastiche will accelerate and will become total.
7 days trying to install python and their packages and failed. Have to remove those garbages , global dependencies from my machine. Such a waste of ecosystem.
I'm using the Docker one, so much easier and no worries of polluting my real environment (all the installation scripts tend to download a variety of things from a variety of places).
Openvivo Stable Diffusion (CPU port of SD) is a easy install on Linux within a venv. Be sure to update Pip first before installing the required packages from the list. The lack of GPU acceleration and the associated baggage makes this much easier to set up and run.
It took me some time to get the OpenVivo distribution running on my Windows box. It turns out that it wasn't compatible with Python 3.10, I had to go back to 3.9. Maybe that'll help you?
[+] [-] fab1an|3 years ago|reply
What we're seeing now are toy-era seeds for what's possible - e.g. I've been making a completely Midjourney-generated "interactive" film called SALT: https://twitter.com/SALT_VERSE/status/1536799731774537733
That would have been completely impossible just a few months ago. Incredibly exciting to think what we'll be able to do just one year from now..
[+] [-] nerdponx|3 years ago|reply
Other than "you can't trust anything you don't see with your own eyes", what kind of shift is it? People lived like that for literally millennia before photography, audio, and video recording.
At absolute worst, we are only undoing about 150 years of development, and only "kind of" and only in certain scenarios.
Moreover, people were making convincing edits of street signs, etc. literally 20 years ago using just Photoshop. What does this really change at a fundamental level? Okay, so you can mimic voices and generate video, rather than just static images. But people have been making hoax recordings and videos for longer than we've had computers.
I think the effects of this stuff will be: 1) making it easier/cheaper to create certain forms of art and entertainment media, 2) making it easier to create hoaxes, and 3) we will eventually need to contend with challenges to IP law. That's about it. I think it will create a lot of value for a lot of people (sibling comment makes a good point about this being equivalent to CGI), but I don't see the big societal shift you're claiming that this is.
[+] [-] kasperni|3 years ago|reply
[1] https://www.reddit.com/r/midjourney/comments/x0kv8s/testp_ju...
[2] https://www.reddit.com/r/midjourney/comments/wz1am0/homer_si...
[3] https://www.reddit.com/r/midjourney/comments/x10som/the_amou...
[4] https://www.reddit.com/r/midjourney/comments/x12nqz/robert_d...
[+] [-] seydor|3 years ago|reply
Still, humans use art to communicate intent, and we still consider AIs to be 'things' , no agency or intent. Being an artist just became a lot harder, because no amount of technical prowess can make you stand out. It s all about the narrative now
[+] [-] time_to_smile|3 years ago|reply
I think what we have is a toy and will remain a toy, just like Eliza was 60 years ago. Academically fascinating, and given the constraints of the era, genuinely remarkable, but still a long way from really being useful.
I'm already getting bored of seeing 95% amazing 5% wtf AI generated images, I can't fathom how anyone else remains excited about this stuff so long. My slack is filled with impressive-but-not-quite-right images of all sorts of outrageous scenarios.
But that's the catch. These diffusion models are stuck creating wacky or surreal images because those contexts are essential allowing you to easily ignore how much these generates miss the mark.
Synthetic AI media won't even been as disruptive as photoshop, let alone the creation of written language.
[+] [-] anonAndOn|3 years ago|reply
[+] [-] paisawalla|3 years ago|reply
"This episode of Law & Order, but if Jerry Orbach never left the show"
"Final Fantasy VII as an FPS taking place in the Call of Duty universe"
"A 3D printable part that will enable automatic firing mode for {a given firearm}"
[+] [-] magicalhippo|3 years ago|reply
I stumbled over Midjourney the other day through these music videos[1][2] generated by Midjourney from the songs lyrics, and I immediately thought we're not far away from this being viable for a cartoon-like film.
Interesting times ahead.
[1]: https://www.youtube.com/watch?v=bulNXhYXgFI
[2]: https://www.youtube.com/watch?v=KVj_AEhpVbA
[+] [-] deviner|3 years ago|reply
[+] [-] Melatonic|3 years ago|reply
[+] [-] syntaxing|3 years ago|reply
[+] [-] ManuelKiessling|3 years ago|reply
You can invite the bot to your server via https://discord.com/api/oauth2/authorize?client_id=101337304...
Talk to it using the /draw Slash Command.
It's very much a quick weekend hack, so no guarantees whatsoever. Not sure how long I can afford the AWS g4dn instance, so get it while it's hot.
Oh and get your prompt ideas from https://lexica.art if you want good results.
PS: Anyone knows where to host reliable NVIDIA-equipped VMs at a reasonable price?
[+] [-] metadat|3 years ago|reply
https://www.vice.com/en/article/xgygy4/stable-diffusion-stab...
Why'd they "overlook" it? Probably more culturally significant and controversial than any of the others. It's the natural elephant.
[+] [-] folli|3 years ago|reply
Some years ago, the pendulum was very much on the other side.
[+] [-] gillesjacobs|3 years ago|reply
[+] [-] danso|3 years ago|reply
Mostly I'm interested in the processing time. Like, using a midrange desktop, what's the average time to expect SD to produce an image from a prompt? Minutes/Tens of minutes/Hours?
[+] [-] cube2222|3 years ago|reply
Keep in mind to have the batch-size low (equal to 1, probably), that was my main issue when I first installed this.
Then, there's lot's of great forks already which add an interactive repl or web ui [0][1]. They also run with half-precision which saves a few bytes. Additionally, they optionally integrate with upscaling neural networks, which means you can generate 512x512 images with stable diffusion and then scale them up to 1024x1024 easily. Moreover, they optionally integrate with face-fixing neural networks, which can also drastically improve the quality of images.
There's also this ultra-optimized repo, but it's a fair bit slower [2].
[0]: https://github.com/lstein/stable-diffusion
[1]: https://github.com/hlky/stable-diffusion
[2]: https://github.com/basujindal/stable-diffusion
[+] [-] chrismorgan|3 years ago|reply
(I say “I think” because I’ve uninstalled the nvidia-dkms package again while I’m not using it because having a functional NVIDIA dual-GPU system in Linux is apparently too annoying: Alacritty takes a few seconds to start because it blocks on spinning up the dGPU for a bit for some reason even though it doesn’t use it, wake from sleep takes five or ten seconds instead of under one second, Firefox glyph and icon caches for individual windows occasionally (mostly on wake) get blatted (that’s actually mildly concerning, though so long as the memory corruption is only in GPU memory it’s probably OK), and if the nvidia modules are loaded at boot time Sway requires --unsupported-gpu and my backlight brightness keys break because the device changes in the /sys tree and I end up with an 0644 root:root brightness file instead of the usual 0664 root:video, and I can’t be bothered figuring it out or arranging a setuid wrapper or whatever. Yeah, now I’m remembering why I would have preferred a single-GPU laptop, to say nothing of the added expense of a major component that had gone completely unused until this week. But no one sells what I wanted without a dedicated GPU for some reason.)
[+] [-] cbozeman|3 years ago|reply
With an RTX 3060, your average image generation time is going to be around 7-11 seconds if I recall correctly. This swings wildly based on how you adjust different settings, but I doubt you'll ever require more than 70 seconds to generate an image.
[+] [-] orangecat|3 years ago|reply
[+] [-] sarsway|3 years ago|reply
My first impression is it seems a lot more useful then DALL-E, because you can quickly iterate on prompts, and also generate many batches, picking the best ones. To get something that's actually usable, you'll have to tinker around a bit and give it a few tries. With DALL-E, feedback is slower, and there's reluctance to just hammer prompts because of credits.
[+] [-] mlsu|3 years ago|reply
I was able to obtain 256x512 images with this card using the standard model, but ran into OOM issues.
I don't mind waiting, so now I am using the "fast" repo:
https://github.com/basujindal/stable-diffusion
With this, it takes 30s to generate a 768x512 image (any larger and I am experiencing OOM issues again). I think you should expect a bit faster at the same resolution with your 3060 because it's a faster card with the same amount of memory.
[+] [-] Morgawr|3 years ago|reply
It runs pretty well but the most I can get is a 768x512 image, but it's pretty good for stuff like visual novel background art[0] and similar things.
[0] - https://twitter.com/xMorgawr/status/1564271156462440448
[+] [-] wccrawford|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] motoboi|3 years ago|reply
This is mind blowing.
[+] [-] stephc_int13|3 years ago|reply
I see it as a cheap and fast alternative to paying a concept artist.
But not a revolution. Creating precise and coherent assets is going to be a challenge, at least with the current architecture.
From a research perspective this is, I think, much more than a toy, those models can help us better understand the nature of our minds, especially related to their processing of text, images and abstraction.
[+] [-] Valakas_|3 years ago|reply
Did you learn about the "Industrial revolution" or the "agricultural revolution" in class? That didn't take a week, or a year, or a decade to happen. Even the Internet revolution took more than a decade.
This is a revolution. And you're seeing it happen in real time.
[+] [-] amelius|3 years ago|reply
Whereas things we associate more with computers, such as hard thinking, mathematics, etc. turn out to be more difficult to copy by a machine, and therefore perhaps more "human".
[+] [-] cududa|3 years ago|reply
[+] [-] m_ke|3 years ago|reply
Here's some of the stuff I generated: https://imgur.com/a/mfjHNgO
[+] [-] andruby|3 years ago|reply
[+] [-] ebabchick|3 years ago|reply
[+] [-] ok_dad|3 years ago|reply
[+] [-] poisonborz|3 years ago|reply
[+] [-] revskill|3 years ago|reply
[+] [-] drexlspivey|3 years ago|reply
[+] [-] gigel82|3 years ago|reply
[+] [-] digitallyfree|3 years ago|reply
https://github.com/bes-dev/stable_diffusion.openvino
[+] [-] akshayKMR|3 years ago|reply
[+] [-] andybak|3 years ago|reply
[+] [-] CWuestefeld|3 years ago|reply
[+] [-] marc_io|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] lijogdfljk|3 years ago|reply
[+] [-] coding123|3 years ago|reply
Instead of labeling data for what things are, we'll have to label things as being generated or not.
[+] [-] throwaway888abc|3 years ago|reply
Great write up
[+] [-] timost|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]