Benchmark Framework Desktop Mainboard and 4-node cluster

[+] mhitza|7 months ago|reply

I've ran a comparison benchmark for the smaller models https://gist.github.com/mhitza/f5a8eeb298feb239de10f9f60f841...

Comparing it against the RTX 4000 SFF Ada (20GB) which is around $1.2k (if you believe the original price on the nvidia website https://marketplace.nvidia.com/en-us/enterprise/laptops-work...). Which I have access to on a Hetzner GEX44.

I'm going to ballpark it between 2.5-3x faster than the desktop. Except for the tg128 test, where the difference is "minimal" (but I didn't do the math).

[+] yencabulator|7 months ago|reply

The whole point of these integrated memory designs is to go beyond that 20 GB VRAM.

[+] Tsiklon|7 months ago|reply

I see Wendell of Level1Techs combines the two in his video on this system.

Theoretically you can have the best of both worlds if you don’t mind running an Occulink E-GPU enclosure

https://youtu.be/L-xgMQ-7lW0

[+] reissbaker|7 months ago|reply

Thanks for the excellent writeup. I'm pleasantly surprised that ROCm worked as well as it did — for the price these aren't bad for LLM workloads and some moderate gaming. (Apple is probably still the king of affordable at-home inference, but for games... Amazing these days but Linux is so much better.)

[+] mulmen|7 months ago|reply

I switched to Fedora Sway as my daily driver nearly two years ago. A Windows title wasn’t working on my brand new PC. I switched to Steam+Proton+Fedora and it worked immediately. Valve now offers a more stable and complete Windows API through Proton than Microsoft does through Windows itself.

[+] jeffbee|7 months ago|reply

I had been hoping that these would be a bit faster than the 9950X because of the different memory architecture, but it appears that due to the lower power design point the AI Max+ 395 loses across the board, by large margins. So I guess these really are niche products for ML users only, and people with generic workloads that want more than the 9950X offers are shopping for a Threadripper.

[+] dijit|7 months ago|reply

Sounds about right.

I’m struggling to justify the cost of a Threadripper (let alone pro!) for a AAA game studio though.

I wonder who can justify these machines. High frequency trading? data science? shouldn’t that be done on servers?

[+] sliken|7 months ago|reply

Across the board, by a large margin? Phoronix ran 200 benchmarks on the 9950x vs 395x max and found a difference of less than 5%. Not bad considering the average power use was 91 watts vs 154 watts.

If you need the memory bandwidth the strix halo looks good, if you are cache friendly and don't care about using almost double the power than the 9950x is a better deal.

[+] rtkwe|7 months ago|reply

It also seems like the tools aren't there to fully utilize them. Unless I misunderstood he was running off CPU only for all the test so there's still the iGPU and NPU performance that's not been utilized in these tests.

[+] adolph|7 months ago|reply

The Framework Desktop has at least two M.2 connectors for NVME. I wonder if an interconnect with higher performance than Ethernet or Thunderbolt could be established using one of the M.2 to connect to PCIe via Oculink?

[+] nrp|7 months ago|reply

There is also a PCIe x4 slot that you can use for other high throughput network options.

[+] _joel|7 months ago|reply

> usually resulting in one word repeating ad infinitum

I've had that using gemini (via windsurf). Doesn't seem to happen with other models. No idea if there's any correlation but it's an interesting failure mode.

[+] mattnewton|7 months ago|reply

This is usually a symptom of greedy sampling (always picking the most probable token) on smaller models. It's possible that configuration had different sampling defaults, ie. was not using top p or temperature. I'm not familiar with distributed-llama but from searching the git repo it looks like it at least takes a --temperature flag and probably has one for top p.

I'd recommend rerunning the benchmarks with the sampling methods explicitly configured the same in each tool. It's tempting to benchmark with all the nondeterminism turned off, but I think it's less useful since in practice for any model you're self hosting for real work you're going to probably want top-p sampling or something like it and you want to benchmark the implementation of that too.

I've never seen gemini do this though, that'd be kinda wild if they shipped something that samples that way. I wonder if windsurf was sending a different config over the api or if this was a different bug.

[+] justinclift|7 months ago|reply

I've seen that occasionally with one of the deepseek models when using the default Ollama context size of 4096, rather than whatever the model's preferred context size was.

After having that happen, I switched my stuff to check the model's preferred context size, then set the context size to match, before using any given model.

[+] mrbungie|7 months ago|reply

Yep, sometimes Gemini for some reason ends up in what I call "ergodic self-flagellation".

Here are some examples: https://www.reddit.com/r/GeminiAI/comments/1lxqbxa/i_am_actu...

[+] iamtheworstdev|7 months ago|reply

for those who are already in the field and doing these things - if I wanted to start running my own local LLM.. should I find an Nvidia 5080 GPU for my current desktop or is it worth trying one of these Framework AMD desktops?

[+] loudmax|7 months ago|reply

The short answer is that the best value is a used RTX 3090 (the long answer being, naturally, it depends). Most of the time, the bottleneck for running LLMs on consumer grade equipment is memory and memory bandwidth. A 3090 has 24GB of VRAM, while a 5080 only has 16GB of VRAM. For models that can fit inside 16GB of VRAM, the 5080 will certainly be faster than the 3090, but the 3090 can run models that simply won't fit on a 5080. You can offload part of the model onto the CPU and system RAM, but running a model on a desktop CPU is an enormous drag, even when only partially offloaded.

Obviously an RTX 5090 with 32GB of VRAM is even better, but they cost around $2000, if you can find one.

What's interesting about this Strix Halo system is that it has 128GB of RAM that is accessible (or mostly accessible) to the CPU/GPU/APU. This means that you can run much larger models on this system than you possibly could on a 3090, or even a 5090. The performance tests tend to show that the Strix Halo's memory bandwidth is a significant bottleneck though. This system might be the most affordable way of running 100GB+ models, but it won't be fast.

[+] wmf|7 months ago|reply

If you think the future is small models (27B) get Nvidia; if you think larger models (70-120B) are worth it then you need AMD or Apple.

[+] jauntywundrkind|7 months ago|reply

> For networking, I expected more out of the Thunderbolt / USB4 ports, but could only get 10 Gbps.

I really wish we saw more testing of USB subsystems! With PCIe being so limited, there's such allure to having two USB4 ports! But will they work?

IIRC we saw similar very low bandwidth on Apple's ARM chips too. This was during M1 or so; dunno if things got better with that chip or future ones! Presumably so or I feel like we'd be hearing about it, but also, these things can just go so hidden!

It was really cool back in Ryzen 1 era seeing their CPU get some USB on the cpu itself, not have to go through the IO/peripheral Hub (southbridge?), with its limited connection to the CPU. There's a great up breakout chart here, showing both the 1800x and the various chipsets available: relishable data. https://www.techpowerup.com/cpu-specs/ryzen-7-1800x.c1879

I feel like there's been some recent improvements to USB4/thunderbolt in the kernel, to really insure all lanes get used. But I'm struggling to find a reference/link. What kernel was this tested against? If nothing else, it's be great to poke around at debugfs, to make sure it's getting all the lanes configured. https://www.phoronix.com/news/Linux-6.13-USB-Changes

[+] Havoc|7 months ago|reply

Jeff - check out the distributed-llama project...you should be able to distribute over entire cluster

[+] geerlingguy|7 months ago|reply

I've been testing Exo (seems dead), llama.cpp RPC (has a lot of performance limitations) and distributed-llama (faster but has some Vulkan quirks and only works with a few models).

See my AI cluster automation setup here: https://github.com/geerlingguy/beowulf-ai-cluster

I was building that through the course of making this video, because it's insane how much manual labor people put into building home AI clusters :D

[+] yjftsjthsd-h|7 months ago|reply

https://github.com/b4rtaz/distributed-llama ?

[+] burnte|7 months ago|reply

He mentioned that in the video.

[+] sliken|7 months ago|reply

Apparently the frameworks desktop's 5g bit network isn't fast enough to scale well with LLM inference workloads, even for a modest GPU. Anyone know what kind of network is required to scale well for a single modest GPU?

[+] geerlingguy|7 months ago|reply

In the case of llama.cpp's RPC mode, the network isn't the limiting factor for inference, but for distributing layers to nodes.

I was monitoring the network while running various models, and for all models, the first step was to copy over layers (a few gigabytes to 100 or so GB for the huge models), and that would max out the 5 Gbps connection.

But then while warming up and processing, there were only 5-10 Mbps of traffic, so you could do it over a string and tin cans, almost.

But that's a limitation of the current RPC architecture, it can't really parallelize processing, so as I noted in the post and in my video, it kinda uses resources round-robin style, and you can only get worse performance across the entire cluster than on a single node for any model you can fit on the single node.

[+] jauntywundrkind|7 months ago|reply

Good news: USB4 mandates a direct host-to-host connectivity! Something it brought in from Thunderbolt. Hypothetically that should be 40Gbit connections, readily available.

There's some folks who use this for clustering. Here's a reddit around Mac systems. The top link is to a really not great hub-spoke model usage (not everyone has BGP skills alas). I've linked to that, 19m35s in. https://www.reddit.com/r/MacStudio/comments/1mc1z0s/anyone_c... https://youtu.be/Ju0ndy2kwlw?t=19m35s

I do hope that CXL 3.1 with its host to host capability makes gluess scale out easier. It's hyped as being for accelerators and attached memory, but having a much lower overhead RDMA capable fabric in every PCIe+CXL port is very very alluring. Can't come soon enough! Servers at first and maybe I'm hopelessly naive here but I do sort of expect it to show up on consumer too.

[+] rtkwe|7 months ago|reply

No network interconnect is going to scale well until you get into the expensive enterprise realm where infiniband and other direct connect copper/fiber reigns. The issue is less raw bandwidth but latency. Network is inherently 100x+ slower than memory access so when you start sharing a memory intensive workload like an LLM across a normal network it's going to crater your performance unless the work can be somewhat chunked to keep communication between nodes on the network to a minimum.

[+] bawana|7 months ago|reply

Mem bandwidth sucks compared to Mac Studio ultra 3. And you cant add gpus easily although as an apu it is impressive and way better than nvidias gold box. Wendell said it better. Im waiting for the Mac Studio ultra 5

[+] jauntywundrkind|7 months ago|reply

Yes. And this is so crucial! It's still a huge leap forward for x86. Quad-channel (4x) DDR5-8000 is both double the (client/non-server) lane count, and at a blisteringly high clock rate. That's very impressive.

Upcoming Zen6 Epyc was just confirmed to go to 12-> 16 channel. That'll be very good to see. The Strix Halo successor Medusa Halo is supposed to be 6x channel. (Most of these rumors/leaks via Moore's Law is Dead fwiw). It's absolutely needed to scale to more cores. But still seems so short of what AI demands.

I really can't congratulate Apple enough for being deadly serious about memory bandwidth. What is just gobsmacking to me is that no one else has responded, half a decade latter. Put the ram on chip! DDR, not crazy expensive HBM. The practice of building super chips out of 4x chips, getting scalability that way, feels so obvious too, is so commendable!

Different end of the spectrum, but Intels tablet-size Lakefield had Package-on-Package (PoP) ram, and pretty fast for its day (4266MHz). But it didn't scale up the width, like Apple has.

Its hard to see x86 seem so stuck, be so unable to make what feels like such a necessary push to

[+] hinkley|7 months ago|reply

Jeff! Someone needs to make Framework MBs work in a blade arrangement, and you seem to be the likely person to get it done.

[+] chickensong|7 months ago|reply

[+] syntaxing|7 months ago|reply

Kinda bummed, I get why he used Ollama but I feel like using llama cpp directly would provide better and more consistent results

[+] RossBencina|7 months ago|reply

I heard that ik_llama.cpp performs better for CPU use: https://github.com/ikawrakow/ik_llama.cpp/

[+] mkl|7 months ago|reply

As the article describes, most of this was done with llama.cpp, not Ollama.

[+] OrangeMusic|6 months ago|reply

Can you imagine a Beowulf cluster of these?

[+] xemdetia|7 months ago|reply

I was about to be annoyed until you said you got preprod units. I guess I'll have to build on this when my desktop shows up.

[+] lifeinthevoid|7 months ago|reply

Setup looks very sexy.

[+] nektro|7 months ago|reply

no compilation tests?

[+] geerlingguy|7 months ago|reply

Those are in my SBC-reviews repo: https://github.com/geerlingguy/sbc-reviews/issues/80

[+] oblio|7 months ago|reply

Those numbers are better than I was expecting.

[+] jvanderbot|7 months ago|reply

So, TL;DR?

I saw mixed results but comments suggest very good performance relative to other at-home setups. Can someone summarize?

88 comments