(no title)
tamlin | 1 year ago
GPUs are an evolving target. New GPUs have tensor cores and support all kinds of interesting numeric formats, older GPUs don't support any of the formats that AI workloads are using today (e.g. BF16, int4, all the various smaller FP types).
NPU will be more efficient because it is much less general an GPU and doesn't have any gates for graphics. However, it is also fairly restricted. Cloud hardware is orders of magnitude faster (due to much higher compute resources I/O bandwidth), e.g. https://cloud.google.com/tpu/docs/v6e.
justincormack|1 year ago
tamlin|1 year ago
To perform a multiplication on CPU, even SIMD, that values have to fetched and converted to a form the CPU has multipliers for. This means smaller numeric types penalised. For a 128-bit memory bus, an NPU can fetch 32 4-bit values per transfer; the best case for a CPU is 16 8-bit values.
Details are scant on Microsoft's NPU, but it probably has many parallel multipliers; either in the form of tensor cores or a systolic array. The effective number of matmul's per second (or per memory operation) is higher.