For those of you wondering if this fits your use case vs the RTX 5090 the short answer is this:
The desktop RTX 5090 has 1792 GB/s of memory bandwidth partially due to the 512 bit bus width, compared to the DGX Spark with a 256 bit bus and 273 GB/s memory bandwidth.
The RTX 5090 has 32G of VRAM vs the 128G of “VRAM” in the DGX Spark which is really unified memory.
Also the RTX 5090 has 21760 cuda cores vs 6144 in the DGX Spark. (3.5 x as many). And with the much higher bandwidth in the 5090 you have a better shot at keeping them fed. So for embarrassingly parallel workloads the 5090 crushes the Spark.
So if you need to fit big models into VRAM and don’t care about speed too much because you are for example, building something on your desktop that’ll run on data center hardware in production, the DGX Spark is your answer.
If you need speed and 32G of VRAM is plenty, and you don’t care about modeling network interconnections in production, then the RTX 5090 is what you want.
> building something on your desktop that’ll run on data center hardware in production, the DGX Spark is your answer
It isn't, because it's a different architecture than the datacenter hardware. They're both called "Blackwell", but that's a lie[1] and you still need "real" datacenter Blackwell card for development work. (For example, you can't configure/tune vLLM on Spark, and then move it into a B200 and even expect it to work, etc.)
It's also worth nothing that the 128GB of "VRAM" in the GB10 is even less straightforward than just being aware that the memory is shared with the CPU cores. There's a lot of details in memory performance that differ across both the different core types, and the two core clusters:
I've got the Dell version of the DGX Spark as well, and was very impressed with the build quality overall. Like Jeff Geerling noted, the fans are super quiet. And since I don't keep it powered on continuously and mainly connect to it remotely, the LED is a nice quick check for power.
You can get two Strix Halo PCs with similar specs for that $4000 price.
I just hope that prompt preprocessing speeds will continue to improve, because Strix Halo is still quite slow in that regard.
Then there is the networking. While Strix Halo systems come with two USB4 40Gbit/s ports, it's difficult to
a) connect more than 3 machines with two ports each
b) get more than 23GBit/s or so per connection, if you're lucky. Latency will also be in the 0.2ms range, which leaves room for improvement.
Something like Apple's RDMA via Thunderbolt would be great to have on Strix Halo…
As you allude, the prompt processing speeds are a killer improvement of the Spark which even 2 Strix Halo boxes would not match.
Prompt processing is literally 3x to 4x higher on GPT-OSS-120B once you are a little bit into your context window, and it is similarly much faster for image generation or any other AI task.
Plus the Nvidia ecosystem, as others have mentioned.
If all you care about is token generation with a tiny context window, then they are very close, but that’s basically the only time. I studied this problem extensively before deciding what to buy, and I wish Strix Halo had been the better option.
The primary advantage of the DGX box is that it gives you access to the nVidia ecosystem. You can develop against it almost like a mini version of the big servers you're targeting.
It's not really intended to be a great value box for running LLMs at home. Jeff Geerling talks about this in the article.
NVFP4 (and to a lesser extent, MXFP8) work, in general. In terms of usable FLOPS the DGX Spark and the GMTek EVO-X2 both lose to the 5090, with NCCL and OpenMPI set up the DGX is still the nicest way to dev for our SBSA future. Working on that too, harder problem.
I know it's just a quick test, but llama 3.1 is getting a bit old. I would have liked to see a newer model that can fit, such as gpt-oss-120, (gpt-oss-120b-mxfp4.gguf), which is about 60gb of weights (1).
Correct, most of r/LocalLlama moved onto next gen MoE models mostly. Deepseek introduced few good optimizations that every new model seems to use now too. Llama 4 was generally seen as a fiasco and Meta haven't made a release since
IMHO DGX Spark at $4,000 is a bad deal with only 273 GB/s bandwidth and the compute capacity between a 5070 and a 5070 TI. And with PCIe 5.0 at 64 GB/s it's not such a big difference.
And the 2x 200 GBit/s QSFP... why would you stack a bunch of these? Does anybody actually use them in day-to-day work/research?
I think the selling point is the 128GB of unified system memory. With that you can run some interesting models. The 5090 maxes out at 32GB. And they cost about $3000 and more at the moment.
I have a slightly cheaper similar box, NVIDIA Thor Dev Kit. The point is exactly to avoid deploying code to servers that cost half a million dollars each. It's quite capable in running or training smart LLMs like Qwen3-Next-80B-A3B-Instruct-NVFP4. So long as you don't tear your hair out first figuring out pecularities and fighting with bleeding edge nightly vLLM builds.
Absent disassembly and direct comparison between a DGX Spark and a Dell GB10, I don't think there's sufficient evidence to say what is meaningfully different between these devices (beyond the obvious of the power LED). Anything over 240W is beyond the USB-C EPR spec, and while Dell does have a question ably-compliant USB-C 280W supply, you'd have to compare actual power consumption to see if the Dell supply is actually providing more power. I suspect any other minor differences in experience/performance are more explainable as the consequences on increasing maturity of the DGX software stack than anything unique to the Dell version; particularly any comparisons to very early DGX Spark behavior need to keep in mind that the software and firmware have seen a number of updates.
Comparing notes with Wendell from Level1Techs, the ASUS and Dell GB10 boxes were both able to sustain better performance due to their better thermal management. That's a fairly significant improvement. The Spark's crusted gold facade seems more form over function.
The memory bandwidth limitation is baked into the GB10, and every vendor is going to be very similar there.
I'm really curious to see how things shift when the M5 Ultra with "tensor" matmul functionality in the GPU cores rolls out. This should be a multiples speed up of that platform.
A nice little AI review with comparison of the CPU/Power Draw & Networking would be interested in seeing a fine-tuning comparison too. I think pricing was missing also.
There's an entire line of Linux-supported Jetson products available for your perusal, in addition to all of the GTX and RTX cards that have native ARM64 support.
Jeff, This is the second time you have been given a prosumer level cluster pretty much built for local LLM inference and on both occasions you have performed benchmarks without batching.
If you still have the hardware (this and the Mac cluster) can you PLEASE get some advice and run some actually useful benchmarks?
Batching on a single consumer GPU often results in 3-4x the throughput. We have literally no idea what that batching looks like on a $10k+ cluster without otherwise dropping the cash to find out.
mmaunder|1 month ago
The desktop RTX 5090 has 1792 GB/s of memory bandwidth partially due to the 512 bit bus width, compared to the DGX Spark with a 256 bit bus and 273 GB/s memory bandwidth.
The RTX 5090 has 32G of VRAM vs the 128G of “VRAM” in the DGX Spark which is really unified memory.
Also the RTX 5090 has 21760 cuda cores vs 6144 in the DGX Spark. (3.5 x as many). And with the much higher bandwidth in the 5090 you have a better shot at keeping them fed. So for embarrassingly parallel workloads the 5090 crushes the Spark.
So if you need to fit big models into VRAM and don’t care about speed too much because you are for example, building something on your desktop that’ll run on data center hardware in production, the DGX Spark is your answer.
If you need speed and 32G of VRAM is plenty, and you don’t care about modeling network interconnections in production, then the RTX 5090 is what you want.
kouteiheika|1 month ago
It isn't, because it's a different architecture than the datacenter hardware. They're both called "Blackwell", but that's a lie[1] and you still need "real" datacenter Blackwell card for development work. (For example, you can't configure/tune vLLM on Spark, and then move it into a B200 and even expect it to work, etc.)
[1] -- https://github.com/NVIDIA/dgx-spark-playbooks/issues/22
chao-|1 month ago
https://chipsandcheese.com/p/inside-nvidia-gb10s-memory-subs...
jasoneckert|1 month ago
But the nicest addition Dell made in my opinion is the retro 90's UNIX workstation-style wallpaper: https://jasoneckert.github.io/myblog/grace-blackwell/
ranger_danger|1 month ago
https://www.fsi-embedded.jp/contents/uploads/2018/11/DELLEMC...
mapontosevenths|1 month ago
Tepix|1 month ago
Then there is the networking. While Strix Halo systems come with two USB4 40Gbit/s ports, it's difficult to
a) connect more than 3 machines with two ports each
b) get more than 23GBit/s or so per connection, if you're lucky. Latency will also be in the 0.2ms range, which leaves room for improvement.
Something like Apple's RDMA via Thunderbolt would be great to have on Strix Halo…
coder543|1 month ago
Prompt processing is literally 3x to 4x higher on GPT-OSS-120B once you are a little bit into your context window, and it is similarly much faster for image generation or any other AI task.
Plus the Nvidia ecosystem, as others have mentioned.
One discussion with benchmarks: https://www.reddit.com/r/LocalLLaMA/comments/1oonomc/comment...
If all you care about is token generation with a tiny context window, then they are very close, but that’s basically the only time. I studied this problem extensively before deciding what to buy, and I wish Strix Halo had been the better option.
Aurornis|1 month ago
It's not really intended to be a great value box for running LLMs at home. Jeff Geerling talks about this in the article.
benreesman|1 month ago
kristianp|1 month ago
(1) https://github.com/ggml-org/llama.cpp/discussions/15396
geerlingguy|1 month ago
eurekin|1 month ago
alecco|1 month ago
And the 2x 200 GBit/s QSFP... why would you stack a bunch of these? Does anybody actually use them in day-to-day work/research?
I liked the idea until the final specs came out.
BadBadJellyBean|1 month ago
cat_plus_plus|1 month ago
echion|1 month ago
Sounds interesting; can you suggest any good discussions of this (on the web)?
kachapopopow|1 month ago
cjbgkagh|1 month ago
cpgxiii|1 month ago
geerlingguy|1 month ago
npalli|1 month ago
https://www.dell.com/en-us/shop/desktop-computers/dell-pro-m...
graham33|1 month ago
colordrops|1 month ago
llm_nerd|1 month ago
I'm really curious to see how things shift when the M5 Ultra with "tensor" matmul functionality in the GPU cores rolls out. This should be a multiples speed up of that platform.
cat_plus_plus|1 month ago
dagaci|1 month ago
geerlingguy|1 month ago
postalrat|1 month ago
bigyabai|1 month ago
barelysapient|1 month ago
geerlingguy|1 month ago
nightski|1 month ago
supermatt|1 month ago
If you still have the hardware (this and the Mac cluster) can you PLEASE get some advice and run some actually useful benchmarks?
Batching on a single consumer GPU often results in 3-4x the throughput. We have literally no idea what that batching looks like on a $10k+ cluster without otherwise dropping the cash to find out.