(no title)
JonathanFly | 2 years ago
Some notes:
- This uses a new type of diffusion transformer (similar to Sora) combined with flow matching and other improvements.
- This takes advantage of transformer improvements & can not only scale further but accept multimodal inputs..
- Will be released open, the preview is to improve its quality & safety just like og stable diffusion
- It will launch with full ecosystem of tools
- It's a new base taking advantage of latest hardware & comes in all sizes
- Enables video, 3D & more..
- Need moar GPUs..
- More technical details soon
>Can we create videos similar like sora
Given enough GPUs and good data yes.
>How does it perform on 3090, 4090 or less? Are us mere mortals gonna be able to have fun with it ?
Its in sizes from 800m to 8b parameters now, will be all sizes for all sorts of edge to giant GPU deployment.
(adding some later replies)
>awesome. I assume these aren't heavily cherry picked seeds?
No this is all one generation. With DPO, refinement, further improvement should get better.
>Do you have any solves coming for driving coherency and consistency across image generations? For example, putting the same dog in another scene?
yeah see @Scenario_gg's great work with IP adapters for example. Our team builds ComfyUI so you can expect some really great stuff around this...
>Dall-e often doesn’t even understand negation, let alone complex spatial relations in combination with color assignments to objects.
Imagine the new version will. DALLE and MJ are also pipelines, you can pretty much do anything accurately with pipelines now.
>Nice. Is it an open-source / open-parameters / open-data model?
Like prior SD models it will be open source/parameters after the feedback and improvement phase. We are open data for our LMs but not other modalities.
>Cool!!! What do you mean by good data? Can it directly output videos?
If we trained it on video yes, it is very much like the arch of sora.
cheald|2 years ago
Very interesting. I've been streching my 12GB 3060 as far as I can; it's exciting that smaller hardware is still usable even with modern improvements.
ttul|2 years ago
liuliu|2 years ago
memossy|2 years ago
Bigger than that is also possible, not saturated yet but need more GPUs.
VikingCoder|2 years ago
wongarsu|2 years ago
AMD still suffers from limited resources and doesn't seem willing to spend too much chasing a market that might just be a temporary hype, Google's TPUs are a pain to use and seem to have stalled out, and Intel lacks commitment, and even their products that went roughly in that direction aren't a great match for neural networks because of their philosophy of having fewer more complex cores.
ls612|2 years ago
SV_BubbleTime|2 years ago
You’ll be able to get higher resolution but slowly. Or pay the $2800 for a 5090 and get high res with good speed.
weebull|2 years ago
GPUs need a decent virtual memory system though. The current "it runs or it crashes" situation isn't good enough.
pbhjpbhj|2 years ago
iosjunkie|2 years ago
3abiton|2 years ago
p1esk|2 years ago
netdur|2 years ago
Why is there not a greater focus on quantization to optimize model performance, given the evident need for more GPU resources?
memossy|2 years ago
Need moar GPUs to do a video version of this model similar to Sora now they have proved that Diffusion Transformers can scale with latent patches (see stablevideo.com and our work on that model, currently best open video model).
We have 1/100th of the resources of OpenAI and 1/1000th of Google etc.
So we focus on great algorithms and community.
But now we need those GPUs.
AnthonyMouse|2 years ago
There is an inherent trade off between model size and quality. Quantization reduces model size at the expense of quality. Sometimes it's a better way to do that than reducing the number of parameters, but it's still fundamentally the same trade off. You can't make the highest quality model use the smallest amount of memory. It's information theory, not sorcery.
supermatt|2 years ago
albertzeyer|2 years ago
tithe|2 years ago
If you size the browser window right, paging with the arrow keys (so the document doesn't scroll) you'll see (eg, pages 20-21) the textures of the parrot's feathers are almost identical to the textures of bark on the tree behind the panda bear, or the forest behind the red panda is very similar to the undersea environment.
Even if I'm misunderstanding something fundamental here about this technique, I still find this interesting!
cchance|2 years ago
astrange|2 years ago
2.1 didn't have adoption because people didn't want to deal with the open replacement for CLIP. Or possibly because everyone confused 2.0 and 2.1.
weebull|2 years ago
swyx|2 years ago
how exactly did the community deal with it? interested to learn how to unlearn safety
samstave|2 years ago
>>>Its in sizes from 800m to 8b parameters now, will be all sizes for all sorts of edge to giant GPU deployment.
--
Can you fragment responses such that if an edge device (mobile app) is prompted for [thing] it can pass tokens upstream on the prompt -- Torrenting responses effectively - and you could push actual GPU edge devices in certain climates... like dens cities whom are expected to be a Fton of GPU cycle consumption around the edge?
So you have tiered processing (speed is done locally, quality level 1 can take some edge gpu - and corporate shit can be handled in cloud...
----
Can you fragment and torrent a response?
If so, how is that request torn up and routed to appropriate resources?
BOFH me if this is a stupid question? (but its valid for how we are evolving to AI being intrinsic to our society so quickly.)
swyx|2 years ago
can someone explain how negation is currently done in stable diffusion? and why cant we do it in text LLMs?
scottmf|2 years ago
unknown|2 years ago
[deleted]
sandworm101|2 years ago
Soon the GPU and its associated memory will be on different cards, as once happened with CPUs. The day of the GPU with ram slots is fast approaching. We will soon plug terabytes of ram into our 4090s, then plug a half-dozen 4090s into a raspberry PI to create a Cronenberg rendering monster. Can it generate movies faster than Pixar can write them? Sure. Can it play Factorio? Heck no.
jsheard|2 years ago
If you don't care about bandwidth you can already have a GPU access terabytes of memory across the PCIe bus, but it's too slow to be useful for basically anything. Best case you're getting 64GB/sec over PCIe 5.0 x16, when VRAM is reaching 3.3TB/sec on the highest end hardware and even mid-range consumer cards are doing >500GB/sec.
Things are headed the other way if anything, Apple and Intel are integrating RAM onto the CPU package for better performance than is possible with socketed RAM.
weebull|2 years ago
zettabomb|2 years ago
ltbarcly3|2 years ago