top | item 42895039

Running DeepSeek R1 Models Locally on NPU

37 points| doomroot13 | 1 year ago |blogs.windows.com

15 comments

How much are NPUs more efficient than GPUs ? What are the limitations , it seems it will have support for deepseek R1 soon

tamlin|1 year ago

A decent chunk of AI computation is the ability to do matrix multiplication fast. Part of that is reducing the amount of data transferred to and from the matrix multiplication hardware on the NPU and GPU; memory bandwidth is a significant bottleneck. The article is highlighting 4-bit format use.

GPUs are an evolving target. New GPUs have tensor cores and support all kinds of interesting numeric formats, older GPUs don't support any of the formats that AI workloads are using today (e.g. BF16, int4, all the various smaller FP types).

NPU will be more efficient because it is much less general an GPU and doesn't have any gates for graphics. However, it is also fairly restricted. Cloud hardware is orders of magnitude faster (due to much higher compute resources I/O bandwidth), e.g. https://cloud.google.com/tpu/docs/v6e.

RandomBK|1 year ago

Reminder: DeepSeek distilled models are better thought of as fine-tunes of Qwen/Llama using DeepSeek output, and are not the same as actual DeepSeek v3 or R1.

This unfortunate naming has sown plenty of confusion around DeepSeek's quality and resource requirements. Actual DeepSeek v3/R1 continues to require at least ~100GB of VRAM/Mem/SSD, and this does not change that.

bestouff|1 year ago

Out of curiosity, would an A100 80GB work for this ?

darthrupert|1 year ago

Wait, what am I running on my 32GB Macbook then? I thought it was the 32b version of deepseek-r1.