top | item 46250093

(no title)

mechagodzilla | 2 months ago

You can keep scaling down! I spent $2k on an old dual-socket xeon workstation with 768GB of RAM - I can run Deepseek-R1 at ~1-2 tokens/sec.

discuss

order

Weryj|2 months ago

Just keep going! 2TB of swap disk for 0.0000001 t/sec

kergonath|2 months ago

Hang on, starting benchmarks on my Raspberry Pi.

jacquesm|2 months ago

I did the same, then put in 14 3090's. It's a little bit power hungry but fairly impressive performance wise. The hardest parts are power distribution and riser cards but I found good solutions for both.

r0b05|2 months ago

I think 14 3090's are more than a little power hungry!

tucnak|2 months ago

You get occasional accounts of 3090 home-superscalers whereas they would put up eight, ten, fourteen cards. I normally attribute this to obsessive-compulsive behaviour. What kind of motherboard you ended up using and what's the bi-directional bandwidth you're seeing? Something tells me you're not using EPYC 9005's with up to 256x PCIe 5.0 lanes per socket or something... Also: I find it hard to believe the "performance" claims, when your rig is pulling 3 kW from the wall (assuming undervolting at 200W per card?) The electricity costs alone would surely make this intractable, i.e. the same as running six washing machines all at once.

ternus|2 months ago

And if you get bored of that, you can flip the RAM for more than you spent on the whole system!

a012|2 months ago

And heat the whole house in parallel

rpastuszak|2 months ago

Nice! What do you use it for?

mechagodzilla|2 months ago

1-2 tokens/sec is perfectly fine for 'asynchronous' queries, and the open-weight models are pretty close to frontier-quality (maybe a few months behind?). I frequently use it for a variety of research topics, doing feasibility studies for wacky ideas, some prototypy coding tasks. I usually give it a prompt and come back half an hour later to see the results (although the thinking traces are sufficiently entertaining that sometimes it's fun to just read as it comes out). Being able to see the full thinking traces (and pause and alter/correct them if needed) is one of my favorite aspects of being able to run these models locally. The thinking traces are frequently just as or more useful than the final outputs.