Author here. I've updated the article based on your feedback. Thank you.
Key corrections:
Ollama GPU usage - I was wrong. It IS using GPU (verified 96% utilization). My "CPU-optimized backend" claim was incorrect.
FP16 vs BF16 - enum caught the critical gap: I trained with BF16, tested inference with FP16 (broken), but never tested BF16 inference. "GPU inference fundamentally broken" was overclaimed. Should be "FP16 has issues, BF16 untested (likely works)."
llama.cpp - veber-alex's official benchmark link proves it works. My issues were likely version-specific, not representative.
ARM64+CUDA maturity - bradfa was right about Jetson history. ARM64+CUDA is mature. The new combination is Blackwell+ARM64, not ARM64+CUDA itself.
The HN community caught my incomplete testing, overclaimed conclusions, and factual errors.
Is there a reason why you used an LLM for the entire article, and moreover, even for this comment? Couldn't you have at least written this comment yourself?
Asking because (in Ollama terms) it's positively ancient. 0.12.6 being the most recent release (currently).
I'm guessing it _might_ make a difference, as the Ollama crowd do seem to be changing things, adding new features and optimisations (etc) quite often.
For example, that 0.12.6 version is where initial experimental support for Vulkan (ie Intel Xe gpus) was added, and in my testing that worked. Not that Vulkan support would do anything in your case. ;)
Late to the party here, but you should definitely be using pytorch 25.09 (or whatever is latest when you go to check) rather than 24.10. That's a year old pytorch on new hardware, I suspect a lot of these bugs have been fixed.
One of my colleagues wrote a first impressions blog post last week. It's from our company's perspective, but is a solid overview of the product and intended capabilities, from the POV of an AI developer or data scientist.
> There you’ll see the 10 Cortex-X925 (“performance”) cores listed with a peak clock rate of 4 GHz, along with the 10 Cortex-A725 (“efficiency”) cores listed with a peak clock rate of 2.8 GHz
> If you start Python and ask it how many CPU cores you have, it will count both kinds of cores and report 20
> Note that because of the speed difference between the cores, you will want to ensure there is some form of dynamic scheduling in your application that can load balance between the different core types.
Sounds like a new type of hell where I now not only need to manage the threads themselves, but also take into account what type of core they run on, and Python straight up report them as the same.
< The CPU memory is the same as the GPU memory and is much larger than any other discrete GPU available in a desktop. That means much larger datasets and bigger models can be run locally than would be possible otherwise.
Isin't this the same architecture that the Mx from Apple implements from a memory perspective?
I absolutely love it. I’ve been up for days playing with it. But there are some bleeding edge issues. I tried to write a balanced article. I would highly recommend for people that love to get their hands dirty. Blows away any consumer GPU.
I have H100s to myself, and access to more GPUs than I know what to do with in national clusters.
The Spark is much more fun. And I’m more productive. With two of them, you can debug shallow NCCL/MPI problems before hitting a real cluster. I sincerely love Slurm, but nothing like a personal computer.
Nah. Do you have 1st hand experience with Strix Halo? At less than 1600€ for a 128GB configuration it manages >45 tokens/s with gpt-oss 120b. Which is faster than DGX Spark at a fraction of the cost.
One thing I can’t find anyone mention in reviews - does inference screech to a halt when using large context windows on models? Say if you’re in the 100k range on gpt-oss. I’m not concerned about lightning inference speed overall as I understand the purpose of the spark is to be well rounded / trainer tuner. I just want to know if it becomes unusable vs reasonable slowdown at larger contexts. That’s the thing people are unpleasantly surprised to find about a Mac Studio which has prevented me from going that route.
But please have your LLM post writer be less verbose and repetitive. This is like the stock output from any LLM, where it describes in detail and then summarizes back and forth over multiple useless sections. Please consider a smarter prompt and post-editing…
Since the text is obviously LLM output, how much prompting and editing went into this post? Did you have to correct anything that you put into it that it then got wrong or added incorrect output to?
There are bleeding edge issues, everyone dials into transformers so that's generally pain proof.
I haven't exactly bisected the issue but I'm pretty sure convolutions are broken on sm_121 after a certain size, getting 20x memory blowup from a convolution from a 2x batch size increase _only_ on the DGX Spark.
I haven't had any problems with inference, but I also don't use the transformers library that much.
llama.cpp was working for openai-oss last time I checked and on release, not sure if something broke along the way.
I don't exactly know if memory fragmentation is something fixable on the driver side - this might just be the problem with kernel's policy and GPL, it prevents them from automatically interfering with the memory subsystem to the granularity they'd like - see zfs and their page table antics - or so my thoughts on it is.
If you've done stuff on WSL, you have similar issues and you can fix it by running a service that normally compacts and clean memory, I have it run every hour. Note that this does impact at the very least CPU performance and memory allocation speeds, but I have not have any issue with long training runs with it (24hr+, assuming that is the issue, I have never tried without it and put that service in place since getting it due to my experience on WSL).
I'm not yet using mine for ML stuff because there are still a lot of various issues like this post outlined. But I am using mine as an ARM dev system in the meantime, and as a "workstation" it's actually quite good. The Cortex-X925 cores are Zen5 class in performance and it is overall an absolute unit for its size, I'm very impressed that a standard ARM core is pushing this level of performance for a desktop-class machine. I thought about buying a new Linux desktop recently, and this is good enough I might just plug it into a monitor and use it instead.
It is also a standard UEFI+ACPI system; one Reddit user even reported that they were able to boot up Fedora 42 and install the open kernel modules no problem. The overall delta/number of specific patches for the Canonical 6.17-nvidia tree is pretty small when I looked (the current kernel is 6.11). That and the likelihood the consumer variant will support Windows hopefully bodes well for its upstream Linux compatibility, I hope.
To be fair, most of this also true of Strix Halo from what I can tell (most benchmarks put the DGX furthest ahead at prompt processing and a bit ahead at raw token output. But the software is still buggy and Blackwell is still a bumpy ride overall, so it might get better). But I think it's mostly the pricing that is holding it back. I'm curious what the consumer variant will be priced at.
There wasn't any instructions how the author got ollama/llama.cpp, could possibly be something nvidia shipped with the DGX Spark and is an old version?
Theoretically it has slightly better memory bandwidth, (you are supposed to get) the Nvidia AI software ecosystem support out of the box, and you can use the 200G NIC to stick 2 together more efficiently.
Practically, if the goal is 100% about AI and cloud isn't an option for some reason, both options are likely "a great way to waste a couple grand trying to save a couple grand" as you'd get 7x the performance and likely still feel it's a bit slow on larger models using an RTX Pro 6000. I say this as a Ryzen AI Max+ 395 owner, though I got mine because it's the closest thing to an x86 Apple Silicon laptop one can get at the moment.
Because the ML ecosystem is more mature on the NVidia side. Software-wise the cuda platform is more advanced. It will be hard for AMD to catch up. It is good to see competition tho.
The PyTorch 2.9 wheels do work. You can pip install torch --index-url <whatever-it-is> and it just works. You do need to build flash attention from source, which takes an hour or so.
Am I reading this right? I was expecting much more performance. My 64G M1 Max has 40.72 tok/s on ollama/GPT-OSS-20B (less than half the price of this machine), and M4 Max 128G from a colleague (but 32G would work) gets about 67 tok/s on ollama/GPT-OSS-20B, and apparently the most recent software updates push that to 78 tok/s. The DGX Spark gets 82.74 tok/s.
I'm utterly shocked at the article saying GPU inference (PyTorch/Transformers)isn't working. Numerical instability produces bad outputs,
Not viable for real-time serving, Wait for driver/CUDA updates!
My job just got me and our entire team a DGX spark.
I'm impressed at the ease of use for ollama models I couldn't run on my laptop.
gpt-oss:120b is shockingly better than what I thought it would be from running the 20b model on my laptop.
The DGX has changed my mind about the future being small specialized models.
Totally agree. I’ve been training nanochat models all morning. Hit some speed bumps. I’ll share more later in another article. Buts it’s absolutely amazing. I fine tuned a Gemma3 model in a day yesterday.
Nvidia products including from the GPU/CUDA libraries world, the NICs and switches tend to feel like MVP frequently. It works in some cases, hopefully in the end but they are far from polished products without rough edges.
So, it seems like this makes the DGX a viable ARM-based workstation, for those of us who need/want such a thing, while also offering a relatively decent AI/ML environment.
Two things need to happen for me to get excited about this:
1. It stimulates other manufacturers into building their own DGX-class workstations.
2. This all eventually gets shipped in a decent laptop product.
As much as it pains me, until that happens, it still seems like Apple Sillicon is the more viable option, if not the most ethical.
It does - the inference speed is much slower than a consumer video card. The draw for the Spark and systems like it are the massive amounts of memory available to the GPU.
The memory bandwidth on this thing is absolute trash, better buy a mac mini/studio with this much ram if you're throwing this much money, it'll be faster (M4 Max)
RyeCatcher|4 months ago
Key corrections:
Ollama GPU usage - I was wrong. It IS using GPU (verified 96% utilization). My "CPU-optimized backend" claim was incorrect.
FP16 vs BF16 - enum caught the critical gap: I trained with BF16, tested inference with FP16 (broken), but never tested BF16 inference. "GPU inference fundamentally broken" was overclaimed. Should be "FP16 has issues, BF16 untested (likely works)."
llama.cpp - veber-alex's official benchmark link proves it works. My issues were likely version-specific, not representative.
ARM64+CUDA maturity - bradfa was right about Jetson history. ARM64+CUDA is mature. The new combination is Blackwell+ARM64, not ARM64+CUDA itself.
The HN community caught my incomplete testing, overclaimed conclusions, and factual errors.
Ship early, iterate publicly, accept criticism gracefully.
Thanks especially to enum, veber-alex, bradfa, furyofantares, stuckinhell, jasonjmcghee, eadwu, and renaudr. The article is significantly better now.
Tiberium|4 months ago
colechristensen|4 months ago
justinclift|4 months ago
Is that version correct?
Asking because (in Ollama terms) it's positively ancient. 0.12.6 being the most recent release (currently).
I'm guessing it _might_ make a difference, as the Ollama crowd do seem to be changing things, adding new features and optimisations (etc) quite often.
For example, that 0.12.6 version is where initial experimental support for Vulkan (ie Intel Xe gpus) was added, and in my testing that worked. Not that Vulkan support would do anything in your case. ;)
sgillen|4 months ago
loufe|4 months ago
eitally|4 months ago
https://www.anaconda.com/blog/python-nvidia-dgx-spark-first-...
CaptainOfCoit|4 months ago
> If you start Python and ask it how many CPU cores you have, it will count both kinds of cores and report 20
> Note that because of the speed difference between the cores, you will want to ensure there is some form of dynamic scheduling in your application that can load balance between the different core types.
Sounds like a new type of hell where I now not only need to manage the threads themselves, but also take into account what type of core they run on, and Python straight up report them as the same.
NathanielK|4 months ago
victor106|4 months ago
Isin't this the same architecture that the Mx from Apple implements from a memory perspective?
RyeCatcher|4 months ago
enum|4 months ago
I have H100s to myself, and access to more GPUs than I know what to do with in national clusters.
The Spark is much more fun. And I’m more productive. With two of them, you can debug shallow NCCL/MPI problems before hitting a real cluster. I sincerely love Slurm, but nothing like a personal computer.
Tepix|4 months ago
Nah. Do you have 1st hand experience with Strix Halo? At less than 1600€ for a 128GB configuration it manages >45 tokens/s with gpt-oss 120b. Which is faster than DGX Spark at a fraction of the cost.
CompoundEyes|4 months ago
yunohn|4 months ago
But please have your LLM post writer be less verbose and repetitive. This is like the stock output from any LLM, where it describes in detail and then summarizes back and forth over multiple useless sections. Please consider a smarter prompt and post-editing…
furyofantares|4 months ago
pertymcpert|4 months ago
ARM64 Architecture: Not x86_64 (limited ML ecosystem maturity) No PyTorch wheels for ARM64+CUDA (must use Docker) Most ML tools optimized for x86
No evidence for any of this whatsoever. The author just asked Claude/claude code to write their article and it just plain hallucinated some rubbish.
bradfa|4 months ago
furyofantares|4 months ago
mgdev|4 months ago
eadwu|4 months ago
I haven't exactly bisected the issue but I'm pretty sure convolutions are broken on sm_121 after a certain size, getting 20x memory blowup from a convolution from a 2x batch size increase _only_ on the DGX Spark.
I haven't had any problems with inference, but I also don't use the transformers library that much.
llama.cpp was working for openai-oss last time I checked and on release, not sure if something broke along the way.
I don't exactly know if memory fragmentation is something fixable on the driver side - this might just be the problem with kernel's policy and GPL, it prevents them from automatically interfering with the memory subsystem to the granularity they'd like - see zfs and their page table antics - or so my thoughts on it is.
If you've done stuff on WSL, you have similar issues and you can fix it by running a service that normally compacts and clean memory, I have it run every hour. Note that this does impact at the very least CPU performance and memory allocation speeds, but I have not have any issue with long training runs with it (24hr+, assuming that is the issue, I have never tried without it and put that service in place since getting it due to my experience on WSL).
aseipp|4 months ago
It is also a standard UEFI+ACPI system; one Reddit user even reported that they were able to boot up Fedora 42 and install the open kernel modules no problem. The overall delta/number of specific patches for the Canonical 6.17-nvidia tree is pretty small when I looked (the current kernel is 6.11). That and the likelihood the consumer variant will support Windows hopefully bodes well for its upstream Linux compatibility, I hope.
To be fair, most of this also true of Strix Halo from what I can tell (most benchmarks put the DGX furthest ahead at prompt processing and a bit ahead at raw token output. But the software is still buggy and Blackwell is still a bumpy ride overall, so it might get better). But I think it's mostly the pricing that is holding it back. I'm curious what the consumer variant will be priced at.
veber-alex|4 months ago
There are official benchmarks of the Spark running multiple models just fine on llama.cpp
https://github.com/ggml-org/llama.cpp/discussions/16578
CaptainOfCoit|4 months ago
moffkalast|4 months ago
RyeCatcher|4 months ago
MaKey|4 months ago
zamadatix|4 months ago
Practically, if the goal is 100% about AI and cloud isn't an option for some reason, both options are likely "a great way to waste a couple grand trying to save a couple grand" as you'd get 7x the performance and likely still feel it's a bit slow on larger models using an RTX Pro 6000. I say this as a Ryzen AI Max+ 395 owner, though I got mine because it's the closest thing to an x86 Apple Silicon laptop one can get at the moment.
d3m0t3p|4 months ago
simlevesque|4 months ago
pjmlp|4 months ago
enum|4 months ago
jsheard|4 months ago
RyeCatcher|4 months ago
spwa4|4 months ago
Ryzen Max 395+ gets you 55 tok/s [1]
[1] https://www.reddit.com/r/LocalLLaMA/comments/1nabcek/anyone_...
stuckinhell|4 months ago
My job just got me and our entire team a DGX spark. I'm impressed at the ease of use for ollama models I couldn't run on my laptop. gpt-oss:120b is shockingly better than what I thought it would be from running the 20b model on my laptop.
The DGX has changed my mind about the future being small specialized models.
unknown|4 months ago
[deleted]
RyeCatcher|4 months ago
unknown|4 months ago
[deleted]
jasonjmcghee|4 months ago
Are you shocked because that isn't your experience?
From the article it sounds like ollama runs cpu inference not GPU inference. Is that the case for you?
semessier|4 months ago
MomsAVoxell|4 months ago
Two things need to happen for me to get excited about this:
1. It stimulates other manufacturers into building their own DGX-class workstations.
2. This all eventually gets shipped in a decent laptop product.
As much as it pains me, until that happens, it still seems like Apple Sillicon is the more viable option, if not the most ethical.
gjsman-1000|4 months ago
renaudr|4 months ago
https://cookbook.openai.com/articles/gpt-oss/run-nvidia
EnPissant|4 months ago
RyeCatcher|4 months ago
buyucu|4 months ago
amelius|4 months ago
mgdev|4 months ago
fxtentacle|4 months ago
Really? Less RAM bw than an Epyc CPU? And 4x to 8x less than a consumer GPU?
How come this doesn’t massively limit LLM inference speeds?
qskousen|4 months ago
suprjami|4 months ago
Wow. Where do I sign up?
vardump|4 months ago
thehamkercat|4 months ago