Nvidia DGX Spark: When benchmark numbers meet production reality

Author here. I've updated the article based on your feedback. Thank you.

Key corrections:

Ollama GPU usage - I was wrong. It IS using GPU (verified 96% utilization). My "CPU-optimized backend" claim was incorrect.

FP16 vs BF16 - enum caught the critical gap: I trained with BF16, tested inference with FP16 (broken), but never tested BF16 inference. "GPU inference fundamentally broken" was overclaimed. Should be "FP16 has issues, BF16 untested (likely works)."

llama.cpp - veber-alex's official benchmark link proves it works. My issues were likely version-specific, not representative.

ARM64+CUDA maturity - bradfa was right about Jetson history. ARM64+CUDA is mature. The new combination is Blackwell+ARM64, not ARM64+CUDA itself.

The HN community caught my incomplete testing, overclaimed conclusions, and factual errors.

Ship early, iterate publicly, accept criticism gracefully.

Thanks especially to enum, veber-alex, bradfa, furyofantares, stuckinhell, jasonjmcghee, eadwu, and renaudr. The article is significantly better now.

Tiberium|4 months ago

Is there a reason why you used an LLM for the entire article, and moreover, even for this comment? Couldn't you have at least written this comment yourself?

colechristensen|4 months ago

This looks like better peer review than most of what gets done for scientific papers.

justinclift|4 months ago

> Ollama 0.3.9 for inference

Is that version correct?

Asking because (in Ollama terms) it's positively ancient. 0.12.6 being the most recent release (currently).

I'm guessing it _might_ make a difference, as the Ollama crowd do seem to be changing things, adding new features and optimisations (etc) quite often.

For example, that 0.12.6 version is where initial experimental support for Vulkan (ie Intel Xe gpus) was added, and in my testing that worked. Not that Vulkan support would do anything in your case. ;)

sgillen|4 months ago

Late to the party here, but you should definitely be using pytorch 25.09 (or whatever is latest when you go to check) rather than 24.10. That's a year old pytorch on new hardware, I suspect a lot of these bugs have been fixed.

loufe|4 months ago

Yeah, kudos, OP. It's a very different read before-after.

eitally|4 months ago

One of my colleagues wrote a first impressions blog post last week. It's from our company's perspective, but is a solid overview of the product and intended capabilities, from the POV of an AI developer or data scientist.

https://www.anaconda.com/blog/python-nvidia-dgx-spark-first-...

CaptainOfCoit|4 months ago

> There you’ll see the 10 Cortex-X925 (“performance”) cores listed with a peak clock rate of 4 GHz, along with the 10 Cortex-A725 (“efficiency”) cores listed with a peak clock rate of 2.8 GHz

> If you start Python and ask it how many CPU cores you have, it will count both kinds of cores and report 20

> Note that because of the speed difference between the cores, you will want to ensure there is some form of dynamic scheduling in your application that can load balance between the different core types.

Sounds like a new type of hell where I now not only need to manage the threads themselves, but also take into account what type of core they run on, and Python straight up report them as the same.

NathanielK|4 months ago

This is a much better introduction to the hardware.

victor106|4 months ago

< The CPU memory is the same as the GPU memory and is much larger than any other discrete GPU available in a desktop. That means much larger datasets and bigger models can be run locally than would be possible otherwise.

Isin't this the same architecture that the Mx from Apple implements from a memory perspective?

RyeCatcher|4 months ago

I absolutely love it. I’ve been up for days playing with it. But there are some bleeding edge issues. I tried to write a balanced article. I would highly recommend for people that love to get their hands dirty. Blows away any consumer GPU.

enum|4 months ago

+1

I have H100s to myself, and access to more GPUs than I know what to do with in national clusters.

The Spark is much more fun. And I’m more productive. With two of them, you can debug shallow NCCL/MPI problems before hitting a real cluster. I sincerely love Slurm, but nothing like a personal computer.

Tepix|4 months ago

> Blows away any consumer GPU.

Nah. Do you have 1st hand experience with Strix Halo? At less than 1600€ for a 128GB configuration it manages >45 tokens/s with gpt-oss 120b. Which is faster than DGX Spark at a fraction of the cost.

CompoundEyes|4 months ago

One thing I can’t find anyone mention in reviews - does inference screech to a halt when using large context windows on models? Say if you’re in the 100k range on gpt-oss. I’m not concerned about lightning inference speed overall as I understand the purpose of the spark is to be well rounded / trainer tuner. I just want to know if it becomes unusable vs reasonable slowdown at larger contexts. That’s the thing people are unpleasantly surprised to find about a Mac Studio which has prevented me from going that route.

yunohn|4 months ago

Thanks for this bleeding edge content!

But please have your LLM post writer be less verbose and repetitive. This is like the stock output from any LLM, where it describes in detail and then summarizes back and forth over multiple useless sections. Please consider a smarter prompt and post-editing…

furyofantares|4 months ago

Since the text is obviously LLM output, how much prompting and editing went into this post? Did you have to correct anything that you put into it that it then got wrong or added incorrect output to?

pertymcpert|4 months ago

This article is AI garbage:

ARM64 Architecture: Not x86_64 (limited ML ecosystem maturity) No PyTorch wheels for ARM64+CUDA (must use Docker) Most ML tools optimized for x86

No evidence for any of this whatsoever. The author just asked Claude/claude code to write their article and it just plain hallucinated some rubbish.

bradfa|4 months ago

Aarch64 and CUDA has been a thing for many years on Jetson boards. Claiming CUDA is immature on arm is very strange.

furyofantares|4 months ago

We're getting slopped every day now and upvoting it.

mgdev|4 months ago

Yes. Obvious to anyone who writes AI garbage all day.

eadwu|4 months ago

There are bleeding edge issues, everyone dials into transformers so that's generally pain proof.

I haven't exactly bisected the issue but I'm pretty sure convolutions are broken on sm_121 after a certain size, getting 20x memory blowup from a convolution from a 2x batch size increase _only_ on the DGX Spark.

I haven't had any problems with inference, but I also don't use the transformers library that much.

llama.cpp was working for openai-oss last time I checked and on release, not sure if something broke along the way.

I don't exactly know if memory fragmentation is something fixable on the driver side - this might just be the problem with kernel's policy and GPL, it prevents them from automatically interfering with the memory subsystem to the granularity they'd like - see zfs and their page table antics - or so my thoughts on it is.

If you've done stuff on WSL, you have similar issues and you can fix it by running a service that normally compacts and clean memory, I have it run every hour. Note that this does impact at the very least CPU performance and memory allocation speeds, but I have not have any issue with long training runs with it (24hr+, assuming that is the issue, I have never tried without it and put that service in place since getting it due to my experience on WSL).

aseipp|4 months ago

I'm not yet using mine for ML stuff because there are still a lot of various issues like this post outlined. But I am using mine as an ARM dev system in the meantime, and as a "workstation" it's actually quite good. The Cortex-X925 cores are Zen5 class in performance and it is overall an absolute unit for its size, I'm very impressed that a standard ARM core is pushing this level of performance for a desktop-class machine. I thought about buying a new Linux desktop recently, and this is good enough I might just plug it into a monitor and use it instead.

It is also a standard UEFI+ACPI system; one Reddit user even reported that they were able to boot up Fedora 42 and install the open kernel modules no problem. The overall delta/number of specific patches for the Canonical 6.17-nvidia tree is pretty small when I looked (the current kernel is 6.11). That and the likelihood the consumer variant will support Windows hopefully bodes well for its upstream Linux compatibility, I hope.

To be fair, most of this also true of Strix Halo from what I can tell (most benchmarks put the DGX furthest ahead at prompt processing and a bit ahead at raw token output. But the software is still buggy and Blackwell is still a bumpy ride overall, so it might get better). But I think it's mostly the pricing that is holding it back. I'm curious what the consumer variant will be priced at.

veber-alex|4 months ago

The llama.cpp issues are strange.

There are official benchmarks of the Spark running multiple models just fine on llama.cpp

https://github.com/ggml-org/llama.cpp/discussions/16578

CaptainOfCoit|4 months ago

There wasn't any instructions how the author got ollama/llama.cpp, could possibly be something nvidia shipped with the DGX Spark and is an old version?

moffkalast|4 months ago

Llama.cpp main branch doesn't run on Orins so it's actually weird that it does run on the Spark.

RyeCatcher|4 months ago

Cool I’ll have a look. All reflections I made were first pass stuff.

MaKey|4 months ago

Why would you get this when a Ryzen AI Max+ 395 with 128 GB is a fraction of the price?

zamadatix|4 months ago

Theoretically it has slightly better memory bandwidth, (you are supposed to get) the Nvidia AI software ecosystem support out of the box, and you can use the 200G NIC to stick 2 together more efficiently.

Practically, if the goal is 100% about AI and cloud isn't an option for some reason, both options are likely "a great way to waste a couple grand trying to save a couple grand" as you'd get 7x the performance and likely still feel it's a bit slow on larger models using an RTX Pro 6000. I say this as a Ryzen AI Max+ 395 owner, though I got mine because it's the closest thing to an x86 Apple Silicon laptop one can get at the moment.

d3m0t3p|4 months ago

Because the ML ecosystem is more mature on the NVidia side. Software-wise the cuda platform is more advanced. It will be hard for AMD to catch up. It is good to see competition tho.

simlevesque|4 months ago

CUDA

pjmlp|4 months ago

Complete computer with everything working.

enum|4 months ago

- https://publish.obsidian.md/aixplore/Practical+Applications/...

   Does it work if you change to torch.bfloat16?

- https://publish.obsidian.md/aixplore/Practical+Applications/...

  The PyTorch 2.9 wheels do work. You can pip install torch --index-url <whatever-it-is> and it just works. You do need to build flash attention from source, which takes an hour or so.

jsheard|4 months ago

No mention of the monstrous 200GbE NIC, seems like a waste if people aren't finding a use for it.

RyeCatcher|4 months ago

Need to buy 2 and connect em. :-)

spwa4|4 months ago

Am I reading this right? I was expecting much more performance. My 64G M1 Max has 40.72 tok/s on ollama/GPT-OSS-20B (less than half the price of this machine), and M4 Max 128G from a colleague (but 32G would work) gets about 67 tok/s on ollama/GPT-OSS-20B, and apparently the most recent software updates push that to 78 tok/s. The DGX Spark gets 82.74 tok/s.

Ryzen Max 395+ gets you 55 tok/s [1]

[1] https://www.reddit.com/r/LocalLLaMA/comments/1nabcek/anyone_...

stuckinhell|4 months ago

I'm utterly shocked at the article saying GPU inference (PyTorch/Transformers)isn't working. Numerical instability produces bad outputs, Not viable for real-time serving, Wait for driver/CUDA updates!

My job just got me and our entire team a DGX spark. I'm impressed at the ease of use for ollama models I couldn't run on my laptop. gpt-oss:120b is shockingly better than what I thought it would be from running the 20b model on my laptop.

The DGX has changed my mind about the future being small specialized models.

unknown|4 months ago

[deleted]

RyeCatcher|4 months ago

Totally agree. I’ve been training nanochat models all morning. Hit some speed bumps. I’ll share more later in another article. Buts it’s absolutely amazing. I fine tuned a Gemma3 model in a day yesterday.

unknown|4 months ago

[deleted]

jasonjmcghee|4 months ago

> I'm utterly shocked at the article saying GPU inference (PyTorch/Transformers)isn't working

Are you shocked because that isn't your experience?

From the article it sounds like ollama runs cpu inference not GPU inference. Is that the case for you?

semessier|4 months ago

Nvidia products including from the GPU/CUDA libraries world, the NICs and switches tend to feel like MVP frequently. It works in some cases, hopefully in the end but they are far from polished products without rough edges.

MomsAVoxell|4 months ago

So, it seems like this makes the DGX a viable ARM-based workstation, for those of us who need/want such a thing, while also offering a relatively decent AI/ML environment.

Two things need to happen for me to get excited about this:

1. It stimulates other manufacturers into building their own DGX-class workstations.

2. This all eventually gets shipped in a decent laptop product.

As much as it pains me, until that happens, it still seems like Apple Sillicon is the more viable option, if not the most ethical.

gjsman-1000|4 months ago

NVIDIA, ethical?

renaudr|4 months ago

Have you tried to run GPT-OSS-120b using TRT-LLM (as you hint NVIDIA probably did it for their benchmark)?

https://cookbook.openai.com/articles/gpt-oss/run-nvidia

EnPissant|4 months ago

Did Nvidia release benchmark numbers for this?

RyeCatcher|4 months ago

Would love to hear from others using the spark for model training and development.

buyucu|4 months ago

Strix Halo from AMD appears to be a much more consumer-friendly alternative than DGX Spark.

amelius|4 months ago

Kind of weird that (gpu) training works but inference doesn't ...

mgdev|4 months ago

That makes zero sense.

fxtentacle|4 months ago

„273 GB/sec memory bandwidth“

Really? Less RAM bw than an Epyc CPU? And 4x to 8x less than a consumer GPU?

How come this doesn’t massively limit LLM inference speeds?

qskousen|4 months ago

It does - the inference speed is much slower than a consumer video card. The draw for the Spark and systems like it are the massive amounts of memory available to the GPU.

suprjami|4 months ago

So I can spend thousands of dollars to have an unstable training environment and inference performance worse than a US$200 3060.

Wow. Where do I sign up?

vardump|4 months ago

3060 doesn't have 128 GB RAM.

thehamkercat|4 months ago

The memory bandwidth on this thing is absolute trash, better buy a mac mini/studio with this much ram if you're throwing this much money, it'll be faster (M4 Max)

117 comments