(no title)
RyeCatcher | 4 months ago
Key corrections:
Ollama GPU usage - I was wrong. It IS using GPU (verified 96% utilization). My "CPU-optimized backend" claim was incorrect.
FP16 vs BF16 - enum caught the critical gap: I trained with BF16, tested inference with FP16 (broken), but never tested BF16 inference. "GPU inference fundamentally broken" was overclaimed. Should be "FP16 has issues, BF16 untested (likely works)."
llama.cpp - veber-alex's official benchmark link proves it works. My issues were likely version-specific, not representative.
ARM64+CUDA maturity - bradfa was right about Jetson history. ARM64+CUDA is mature. The new combination is Blackwell+ARM64, not ARM64+CUDA itself.
The HN community caught my incomplete testing, overclaimed conclusions, and factual errors.
Ship early, iterate publicly, accept criticism gracefully.
Thanks especially to enum, veber-alex, bradfa, furyofantares, stuckinhell, jasonjmcghee, eadwu, and renaudr. The article is significantly better now.
Tiberium|4 months ago
CamperBob2|4 months ago
But if that's not the case, then yeah, it's a crappy practice and I'd hate to see it spread any further than it already has.
colechristensen|4 months ago
anticensor|4 months ago
justinclift|4 months ago
Is that version correct?
Asking because (in Ollama terms) it's positively ancient. 0.12.6 being the most recent release (currently).
I'm guessing it _might_ make a difference, as the Ollama crowd do seem to be changing things, adding new features and optimisations (etc) quite often.
For example, that 0.12.6 version is where initial experimental support for Vulkan (ie Intel Xe gpus) was added, and in my testing that worked. Not that Vulkan support would do anything in your case. ;)
sgillen|4 months ago
loufe|4 months ago