top | item 25051741

(no title)

Can anyone who knows about machine learning hardware comment on how much faster dedicated hardware is as opposed to, say, a vulkan compute shader?

discuss

darsnack|5 years ago

On the NVidia A100, the standard FP32 performance is 20 TFLOPs, but if you use the tensor cores and all the ML features available then it peaks out at 300+ TFLOPs. Not exactly your question, but a simple reference point.

Now the accelerator in the M1 is only 11 TFLOPs. So it’s definitely not trying to compete as an accelerator for training.

qayxc|5 years ago

That depends entirely on the hardware of both the ML accelerator and the GPU in question, as well as model architecture, -data and -size.

Unfortunately Apple was very vague when they described the method that yielded the claimed "9x faster ML" performance.

They compared the results using an "Action Classification Model" (size? data types? dataset- and batch size?) between an 8-core i7 and their M1 SoC. It isn't clear whether they're referring to training or inference and if it took place on the CPU or the SoC's iGPU and no GPU was mentioned anywhere either.

So until an independent 3rd party review is available, your question cannot be answered. 9x with dedicated hardware over a thermally- and power constrained CPU is no surprise, though.

Even the notoriously weak previous generation Intel SoCs could deliver up to 7.73x improvement when using the iGPU [1] with certain models. As you can see in the source, some models don't even benefit from GPU acceleration (at least as far as Intel's previous gen SoCs are concerned).

In the end, Apple's hardware isn't magic (even if they will say otherwise;) and more power will translate into higher performance so their SoC will be inferior to high-power GPUs running compute shaders.

[1] https://software.intel.com/content/www/us/en/develop/article...