A decent chunk of AI computation is the ability to do matrix multiplication fast. Part of that is reducing the amount of data transferred to and from the matrix multiplication hardware on the NPU and GPU; memory bandwidth is a significant bottleneck. The article is highlighting 4-bit format use.
GPUs are an evolving target. New GPUs have tensor cores and support all kinds of interesting numeric formats, older GPUs don't support any of the formats that AI workloads are using today (e.g. BF16, int4, all the various smaller FP types).
NPU will be more efficient because it is much less general an GPU and doesn't have any gates for graphics. However, it is also fairly restricted. Cloud hardware is orders of magnitude faster (due to much higher compute resources I/O bandwidth), e.g. https://cloud.google.com/tpu/docs/v6e.
Reminder: DeepSeek distilled models are better thought of as fine-tunes of Qwen/Llama using DeepSeek output, and are not the same as actual DeepSeek v3 or R1.
This unfortunate naming has sown plenty of confusion around DeepSeek's quality and resource requirements. Actual DeepSeek v3/R1 continues to require at least ~100GB of VRAM/Mem/SSD, and this does not change that.
jokowueu|1 year ago
tamlin|1 year ago
GPUs are an evolving target. New GPUs have tensor cores and support all kinds of interesting numeric formats, older GPUs don't support any of the formats that AI workloads are using today (e.g. BF16, int4, all the various smaller FP types).
NPU will be more efficient because it is much less general an GPU and doesn't have any gates for graphics. However, it is also fairly restricted. Cloud hardware is orders of magnitude faster (due to much higher compute resources I/O bandwidth), e.g. https://cloud.google.com/tpu/docs/v6e.
RandomBK|1 year ago
This unfortunate naming has sown plenty of confusion around DeepSeek's quality and resource requirements. Actual DeepSeek v3/R1 continues to require at least ~100GB of VRAM/Mem/SSD, and this does not change that.
bestouff|1 year ago
darthrupert|1 year ago