top | item 39866974

(no title)

They've been designing their own chips a while now, including with an NPU.

Also because of their unified memory design, they actually have insane bandwidth which is incredibly useful for LLMs. IMO they may have a head-start in that respect for on-device inference of large models (e.g. 1B+ params).

discuss

talldayo|1 year ago

I don't think people are running 1B+ models on the Neural Engine these days. The high-performance models I've seen all rely on Metal Performance Shaders, which scales with how powerful your GPU is. It's not terribly slow on iPhone, but I think some people get the wrong idea and correlate an ambient processor like the Neural Engine with LLMs.

The bigger bottleneck seems like memory, to me. iPhones have traditionally skimped on RAM moreso than even cheap and midrange Android counterparts. I can imagine running an LLM in the background on my S10 - it's a bit harder to envision iOS swapping everything smoothly on a similarly-aged iPhone.

JKCalhoun|1 year ago

Sure, but we're discussing 1.8-bit models that, again I'm a layman, I assume are over an order of magnitude smaller in their memory overhead.