top | item 45169945

(no title)

slacka | 5 months ago

I too found that interesting that Apple's Neural Engine doesn't work with local LLMs. Seems like Apple, AMD, and Intel are missing the AI boat by not properly supporting their NPUs in llama.cpp. Any thoughts on why this is?

discuss

order

bigyabai|5 months ago

Most NPUs are almost universally too weak to use for serious LLM inference. Most of the time you get better performance-per-watt out of GPU compute shaders, the majority of NPUs are dark silicon.

Keep in mind - Nvidia has no NPU hardware because that functionality is baked-into their GPU architecture. AMD, Apple and Intel are all in this awkward NPU boat because they wanted to avoid competition with Nvidia and continue shipping simple raster designs.

aurareturn|5 months ago

Apple is in this NPU boat because they are optimized for mobile first.

Nvidia does not optimize for mobile first.

AMD and Intel were forced by Microsoft to add NPUs in order to sell “AI PCs”. Turns out the kind of AI that people want to run locally can’t run on an NPU. It’s too weak like you said.

AMD and Intel both have matmul acceleration directly in their GPUs. Only Apple does not.

numpad0|5 months ago

Perhaps due to sizes? AI/NN models before LLM were magnitudes smaller, as evident in effectively all LLMs carrying "Large" in its name regardless of relative size differences.

Someone|5 months ago

I guess that hardware doesn’t make things faster (¿yet?). If so I guess they would have mentioned it in https://machinelearning.apple.com/research/core-ml-on-device.... That is updated for Sequoia and says

“This technical post details how to optimize and deploy an LLM to Apple silicon, achieving the performance required for real time use cases. In this example we use Llama-3.1-8B-Instruct, a popular mid-size LLM, and we show how using Apple’s Core ML framework and the optimizations described here, this model can be run locally on a Mac with M1 Max with about ~33 tokens/s decoding speed. While this post focuses on a particular Llama model, the principles outlined here apply generally to other transformer-based LLMs of different sizes.”

cma|5 months ago

If it uses a lot less power it could still be a win for some use cases, like while on battery you might still want to run transformer based speech to text, RTX voice-like microphone denoising, image generation/infill in photo editing programs. In some use cases like RTX-voice like stuff during multiplayer gaming, you might want the GPU free to run the game even if it still suffers some memory bandwidth impact from having it running.

GeekyBear|5 months ago

There is no NPU "standard".

Llama.cpp would have to target every hardware vendor's NPU individually and those NPUs tend to have breaking changes when newer generations of hardware are released.

Even Nvidia GPUs often have breaking changes moving from one generation to the next.

montebicyclelo|5 months ago

I think OP is suggesting that Apple / AMD / Intel do the work of integrating their NPUs into popular libraries like `llama.cpp`. Which might make sense. My impression is that by the time the vendors support a certain model with their NPUs the model is too old and nobody cares anyway. Whereas llama.cpp keeps up with the latest and greatest.

svachalek|5 months ago

I think I saw something that got Ollama to run models on it? But it only works with tiny models. Seems like the neural engine is extremely power efficient but not fast enough to do LLMs with billions of parameters.

reddit_clone|5 months ago

I am running Ollama with 'SimonPu/Qwen3-Coder:30B-Instruct_Q4_K_XL' on a M4 pro MBP with 48 GB of memory.

From Emacs/gptel, it seems pretty fast.

I have never used the proper hosted LLMS, so I don't have a direct comparison. But the above LLM answered coding questions in a handful of seconds.

The cost of memory (and disk) upgrades in apple machines is exorbitant.