top | item 44790754

(no title)

tetraodonpuffer | 6 months ago

I think the fact that, as far as I understand, it takes 40GB of VRAM to run, is probably dampening some of the enthusiasm.

As an aside, I am not sure why for LLM models the technology to spread among multiple cards is quite mature, while for image models, despite also using GGUFs, this has not been the case. Maybe as image models become bigger there will be more of a push to implement it.

discuss

order

reissbaker|6 months ago

40GB is small IMO: you can run it on a mid-tier Macbook Pro... or the smallest M3 Ultra Mac Studio! You don't need Nvidia if you're doing at-home inference, Nvidia only becomes economical at very high throughput: i.e. dedicated inference companies. Apple Silicon is much more cost effective for single-user for the small-to-medium-sized models. The M3 Ultra is ~roughly on par with a 4090 in terms of memory bandwidth, so it won't be much slower, although it won't match a 5090.

Also for a 20B model, you only really need 20GB of VRAM: FP8 is near-identical to FP16, it's only below FP8 that you start to see dramatic drop-offs in quality. So literally any Mac Studio available for purchase will do, and even a fairly low-end Macbook Pro would work as well. And a 5090 should be able to handle it with room to spare as well.

dur-randir|6 months ago

Memory bandwidth is only relevant for comparing LLM performance. For image generation, the limiting factor is compute, and Apple sucks with it.

BoredPositron|6 months ago

If you want to wait 20 minutes for one image you can certainly run it on a macbook pro.

RossBencina|6 months ago

Does M3 Ultra or later have hardware FP8 support on the CPU cores?

slickytail|6 months ago

All of this only really applies to LLMs though. LLMs are memory bound (due to higher param counts, KV caching, and causal attention) whereas diffusion models are compute bound (because of full self attention that can't be cached). So even if the memory bandwidth of an M3 ultra is close to an Nvidia card, the generation will be much faster on a dedicated GPU.

cma|6 months ago

If 40GB you can lightly quantize and fit it on a 5090.

AuryGlenz|6 months ago

Which very few people have, comparatively.

Training it will also be out of reach for most. I’m sure I’ll be able to handle it on my own 5090 at some point but it’ll be slow going.

TacticalCoder|6 months ago

> I think the fact that, as far as I understand, it takes 40GB of VRAM to run, is probably dampening some of the enthusiasm.

40 GB of VRAM? So two GPU with 24 GB each? That's pretty reasonable compared to the kind of machine to run the latest Qwen coder (which btw are close to SOTA: they do also beat proprietary models on several benchmarks).

cellis|6 months ago

A 3090 + 2xTitanXP? technically i have 48, but i don't think you can "split it" over multiple cards. At least with Flux, it would OOM the Titans and allocate the full 3090

AuryGlenz|6 months ago

You can’t split image models over 2 GPUs like you can LLMs.