(no title)
magic_at_enimai | 4 years ago
I'll leave the prefix sum for other devs who know more :D
https://github.com/NVIDIA/cutlass/blob/master/examples/27_am...
//part of nod.ai/shark team
magic_at_enimai | 4 years ago
I'll leave the prefix sum for other devs who know more :D
https://github.com/NVIDIA/cutlass/blob/master/examples/27_am...
//part of nod.ai/shark team
raphlinus|4 years ago
Reading between the lines a little, it sounds like your infrastructure is potentially able to exploit a good deal of the available throughput for FP32 workloads. That's great, and I'm happy to see it! However, for workloads that don't need that much precision, the tradeoff might be a lot less advantageous to M1. That may change again if and when Apple opens up lower-level APIs to their hardware, or reverse engineering delivers usable results.
shaklee3|4 years ago