top | item 30435856

(no title)

So with Tensorcores you use TF32 which is more like FP19-ish and the marketing makes you think you get 8x the performance. But if you want actual FP32 precision you will need something like [1] but then your performance in the Tensorcore path is _only_ 2X faster than the SIMT path.

I'll leave the prefix sum for other devs who know more :D

https://github.com/NVIDIA/cutlass/blob/master/examples/27_am...

//part of nod.ai/shark team

discuss

raphlinus|4 years ago

I think we're talking past each other to some extent. Putting aside the question of how misleading it is to market a 16 bit multiply as a "TF32" operation, this is all about tradeoffs. The specific tradeoff that these tensor cores make is that in exchange for reduced precision (and a programming model which is even more of a pain than ordinary compute shaders, an astonishing achievement in and of itself), you get a lot more throughput. For certain AI workloads, particularly inference, that tradeoff is well worth it.

Reading between the lines a little, it sounds like your infrastructure is potentially able to exploit a good deal of the available throughput for FP32 workloads. That's great, and I'm happy to see it! However, for workloads that don't need that much precision, the tradeoff might be a lot less advantageous to M1. That may change again if and when Apple opens up lower-level APIs to their hardware, or reverse engineering delivers usable results.

shaklee3|4 years ago

tf32 and fp16 tensor cores are completely different, and tf32 is not 16 bit multiplication.