top | item 44627617

(no title)

QRY | 7 months ago

Have you considered the Framework Desktop setup they mentioned in their announcement blog post[0]? Just marketing fluff, or is there any merit to it?

> The top-end Ryzen AI Max+ 395 configuration with 128GB of memory starts at just $1999 USD. This is excellent for gaming, but it is a truly wild value proposition for AI workloads. Local AI inference has been heavily restricted to date by the limited memory capacity and high prices of consumer and workstation graphics cards. With Framework Desktop, you can run giant, capable models like Llama 3.3 70B Q6 at real-time conversational speed right on your desk. With USB4 and 5Gbit Ethernet networking, you can connect multiple systems or Mainboards to run even larger models like the full DeepSeek R1 671B.

I'm futsing around with setups, but adding up the specs would give 384GB of VRAM and 512GB total memory, at a cost of about $10,000-$12,000. This is all highly dubious napkin math, and I hope to see more experimentation in this space.

There's of course the moving target of cloud costs and performance, so analysing break-even time is even more precarious. So if this sort of setup would work, its cost-effectiveness is a mystery to me.

[0] https://frame.work/be/en/blog/introducing-the-framework-desk...

discuss

lhl|7 months ago

Strix Halo does not run a 70B Q6 dense model at real-time conversational speed - it has a real-world MBW of about 210 GB/s. A 40GB Q4 will clock just over 5 tok/s. A Q6 would be slower.

It will run some big MoEs at a decent speed (eg, Llama 4 Scout 109B-A17B Q4 at almost 20 tok/s). The other issue is its prefill - only about 200 tok/s due to having only very under-optimized RDNA3 GEMMs. From my testing, you usually have to trade off pp for tg.

If you are willing to spend $10K for hardware, I'd say you are much better off w/ EPYC and 12-24 channels of DDR5, and a couple fast GPUS for shared experts and TFLOPS. But, unless you are doing all-night batch processing, that $10K is probably better spent on paying per token or even renting GPUs (especially when you take into account power).

Of course, there may be other reasons you'd want to inference locally (privacy, etc).

moffkalast|7 months ago

Yeah it's only really viable for chat use cases, coding is the most demanding in terms of generation speed, to keep the workflow usable it needs to spit out corrections in seconds, not minutes.

I use local LLMs as much as possible myself, but coding is the only use case where I still entirely defer to Claude, GPT, etc. because you need both max speed and bleeding edge model intelligence for anything close to acceptable results. When Qwen-3-Coder lands + having it on runpod might be a low end viable alternative, but likely still a major waste of time when you actually need to get something done properly.

cheeze|7 months ago

I love Framework but it's still not enough IMO. My time is the most valuable thing, and a subscription to $paid_llm_of_choice is _cheap_ relative to my time spent working.

In my experience, something Llama 3.3 works really well for smaller tasks. For "I'm lazy and want to provide minimal prompting for you to build a tool similar to what is in this software package already", paid LLMs are king.

If anything, I think the best approach for free LLMs would be to run using rented GPU capacity. I feel bad knowing that I have a 4070ti super that sits idle for 95% of the time. I'd rather share an a1000 with bunch of folks and have that run at close to max utilization.

generic92034|7 months ago

> and a subscription to $paid_llm_of_choice is _cheap_ relative to my time spent working.

In the mid to long term the question is, is the subscription covering the costs of the LLM provider. Current costs might not be stable for long.

jakebennet89|7 months ago

[deleted]

smcleod|7 months ago

The framework desktop isn't really that compelling for work with LLMs, it's memory bandwidth is very low compared to GPUs and Apple Silicon Max/Ultra chips - you'd really notice how slow LLMs are on it to the point of frustration. Even a 2023 Macbook Pro with a M2 Max chip has twice the usable bandwidth.

komali2|7 months ago

They demo'd it live at computex and it was slooooow. Like two characters a second slow. Iirc he had 4 machines clustered.

oblio|7 months ago

One second, don't LLMs generally run in VRAM? If you put them in regular RAM, don't they have to go through the CPU which kills performance?

pxeger1|7 months ago

The mentioned CPU uses unified memory for its built in GPU / NPU. I.e. some portion of what could ordinarily be system RAM is given to the GPU instead of the CPU

zackify|7 months ago

The memory bandwidth is crap and you’ll never run anything close to Claude on that unfortunately. They should have shipped something 8x faster at least 2 tb/s bandwidth