For context: this WebGPU version achieves ~17% of peak theoretical performance of M2. With CUDA (i.e. CuBLAS), you can reach ~75% of peak performance for same matrix config (without tensor core).
this was a huge inspiration for the post! i tried to highlight it in the blog but it might have gotten buried
there are a few things that i wasn't able to figure out how to get access to/i wasn't sure if they were possible. for example, a lot of Simon's article takes advantage of the warp scheduler and warp tiling.
i had a hard time finding information on if that's even possible with my M2/metal and the general memory access patterns. it seems like CUDA does have better documentation in this regard
That's a nice tutorial but just to be clear: that is not a deep dive in any sense. It's just the bog standard tricks. It doesn't cover MMA and WMMA, which today is table stakes for fast matmul. Also doesn't cover software pipelining. It's basically a good summary of the basics.
Can you explain why you did the naive algorithm here and not any of the fast matrix multiplication ones that trade multiplications for more additions? Just for educational purposes or is there a performance benefit in the technique?
at least on my m2, the compiled kernel ends up using fast math anyways so using WGSL's fma didn't change anything about the actual kernel that gets run
To clarify the title: TFLOP/s is the unit the author goes after, not TFLOP. People in the threads compare CUDA performance on GPUs to WebAssembly performance: please recall that H100 has a theoretical performance of about 1000 TFLOP/s for bfloat16, and even moderately complicated algorithms in typical modern transformer architectures can reach about half of that performance.
would be fun to do a leaderboard of some specific size (e.g. 4096x4096x4096) just to get all the code and tricks in one spot for folks to learn about things
As you see, I have implemented 32×32 tiling, using thread groups of 32×8 threads, two groupshared buffers to load tiles of the input matrices, and I accumulate numbers into local variables, 32 / 8 = 4 accumulators per thread.
WebGPU doesn't seem to talk about bank conflict, hiding some hardware details that might be necessary to write the best kernel. will it be able to match the perf of Cuda on the same hardware?
WebGPU cannot even come close unfortunately since they don't have support for hardware specific memory or warp-level primitives (like TMA or tensorcores). it's not like it gets 80% of perf, it gets < 30% of the peak perf for anything related to heavy compute matrix multiplications
great question, to me webGPU sits a hair high level than CUDA or Vulkan. so you don't have the exact same level of control but can get to 80% performance of it without having to write different kernels specific to the hardware
[+] [-] shihab|1 year ago|reply
For context: this WebGPU version achieves ~17% of peak theoretical performance of M2. With CUDA (i.e. CuBLAS), you can reach ~75% of peak performance for same matrix config (without tensor core).
[+] [-] Const-me|1 year ago|reply
Not on the same computer, CUDA doesn’t run on the integrated GPU of the Apple M2 Pro.
[+] [-] brrrrrm|1 year ago|reply
[+] [-] weinzierl|1 year ago|reply
[+] [-] zanussbaum|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] mkeeter|1 year ago|reply
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance (https://siboehm.com/articles/22/CUDA-MMM)
(It's CUDA-specific, so there may be aspects that can't yet be ported to WGPU)
[+] [-] zanussbaum|1 year ago|reply
there are a few things that i wasn't able to figure out how to get access to/i wasn't sure if they were possible. for example, a lot of Simon's article takes advantage of the warp scheduler and warp tiling.
i had a hard time finding information on if that's even possible with my M2/metal and the general memory access patterns. it seems like CUDA does have better documentation in this regard
[+] [-] almostgotcaught|1 year ago|reply
[+] [-] inglor|1 year ago|reply
[+] [-] saagarjha|1 year ago|reply
[+] [-] zanussbaum|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] pama|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] saagarjha|1 year ago|reply
[+] [-] FL33TW00D|1 year ago|reply
Also does quantized matmuls.
[+] [-] brrrrrm|1 year ago|reply
[+] [-] coffeeaddict1|1 year ago|reply
[+] [-] Const-me|1 year ago|reply
As you see, I have implemented 32×32 tiling, using thread groups of 32×8 threads, two groupshared buffers to load tiles of the input matrices, and I accumulate numbers into local variables, 32 / 8 = 4 accumulators per thread.
[+] [-] lostmsu|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] billconan|1 year ago|reply
[+] [-] brrrrrm|1 year ago|reply
[+] [-] zanussbaum|1 year ago|reply
[+] [-] jsbsjwbw|1 year ago|reply
[deleted]
[+] [-] maelito|1 year ago|reply
The smoothness of an iPhone map zoom, on any device.
[+] [-] jsheard|1 year ago|reply
Any device except an iPhone, until Apple finally gets around to shipping WebGPU in Safari. Any year now...