Optimizing a WebGPU Matmul Kernel for 1 TFLOP

[+] shihab|1 year ago|reply

Great article!

For context: this WebGPU version achieves ~17% of peak theoretical performance of M2. With CUDA (i.e. CuBLAS), you can reach ~75% of peak performance for same matrix config (without tensor core).

[+] Const-me|1 year ago|reply

> you can reach ~75% of peak performance for same matrix config

Not on the same computer, CUDA doesn’t run on the integrated GPU of the Apple M2 Pro.

[+] brrrrrm|1 year ago|reply

how are you running CUDA on the integrated Apple silicon GPU these days?

[+] weinzierl|1 year ago|reply

75% can't be the best we can do. What would reach 100% or nearly 100%? Handcoded assembly?

[+] zanussbaum|1 year ago|reply

thanks! and yes definitely not at CUDA levels :)

[+] unknown|1 year ago|reply

[deleted]

[+] mkeeter|1 year ago|reply

For a very deep dive into the subject, this is a great writeup:

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance (https://siboehm.com/articles/22/CUDA-MMM)

(It's CUDA-specific, so there may be aspects that can't yet be ported to WGPU)

[+] zanussbaum|1 year ago|reply

this was a huge inspiration for the post! i tried to highlight it in the blog but it might have gotten buried

there are a few things that i wasn't able to figure out how to get access to/i wasn't sure if they were possible. for example, a lot of Simon's article takes advantage of the warp scheduler and warp tiling.

i had a hard time finding information on if that's even possible with my M2/metal and the general memory access patterns. it seems like CUDA does have better documentation in this regard

[+] almostgotcaught|1 year ago|reply

That's a nice tutorial but just to be clear: that is not a deep dive in any sense. It's just the bog standard tricks. It doesn't cover MMA and WMMA, which today is table stakes for fast matmul. Also doesn't cover software pipelining. It's basically a good summary of the basics.

[+] inglor|1 year ago|reply

Can you explain why you did the naive algorithm here and not any of the fast matrix multiplication ones that trade multiplications for more additions? Just for educational purposes or is there a performance benefit in the technique?

[+] saagarjha|1 year ago|reply

Because those algorithms are generally not worth implementing even though their algorithmic complexity is theoretically lower.

[+] zanussbaum|1 year ago|reply

at least on my m2, the compiled kernel ends up using fast math anyways so using WGSL's fma didn't change anything about the actual kernel that gets run

[+] unknown|1 year ago|reply

[deleted]

[+] pama|1 year ago|reply

To clarify the title: TFLOP/s is the unit the author goes after, not TFLOP. People in the threads compare CUDA performance on GPUs to WebAssembly performance: please recall that H100 has a theoretical performance of about 1000 TFLOP/s for bfloat16, and even moderately complicated algorithms in typical modern transformer architectures can reach about half of that performance.

[+] unknown|1 year ago|reply

[deleted]

[+] saagarjha|1 year ago|reply

H100 can do well over 1500 TFLOPS in fp16.

[+] FL33TW00D|1 year ago|reply

I wrote something similar a while back: https://github.com/FL33TW00D/wgpu-mm

Also does quantized matmuls.

[+] brrrrrm|1 year ago|reply

would be fun to do a leaderboard of some specific size (e.g. 4096x4096x4096) just to get all the code and tricks in one spot for folks to learn about things

[+] coffeeaddict1|1 year ago|reply

You can do slightly better fairly easily I think. See here for example https://github.com/AnswerDotAI/gpu.cpp/pull/35

[+] Const-me|1 year ago|reply

Couple years ago, I wanted about the same thing in HLSL language, for a Direct3D 11.0 compute shader. Here’s the fastest version I managed to make back then: https://github.com/Const-me/Cgml/blob/master/Mistral/Mistral...

As you see, I have implemented 32×32 tiling, using thread groups of 32×8 threads, two groupshared buffers to load tiles of the input matrices, and I accumulate numbers into local variables, 32 / 8 = 4 accumulators per thread.

[+] lostmsu|1 year ago|reply

What's the perf like?

[+] unknown|1 year ago|reply

[deleted]

[+] billconan|1 year ago|reply

WebGPU doesn't seem to talk about bank conflict, hiding some hardware details that might be necessary to write the best kernel. will it be able to match the perf of Cuda on the same hardware?

[+] brrrrrm|1 year ago|reply

WebGPU cannot even come close unfortunately since they don't have support for hardware specific memory or warp-level primitives (like TMA or tensorcores). it's not like it gets 80% of perf, it gets < 30% of the peak perf for anything related to heavy compute matrix multiplications

[+] zanussbaum|1 year ago|reply

great question, to me webGPU sits a hair high level than CUDA or Vulkan. so you don't have the exact same level of control but can get to 80% performance of it without having to write different kernels specific to the hardware

[+] jsbsjwbw|1 year ago|reply

[deleted]

[+] maelito|1 year ago|reply

WebGPU will make Web maps even more competitive than they are already.

The smoothness of an iPhone map zoom, on any device.

[+] jsheard|1 year ago|reply

> The smoothness of an iPhone map zoom, on any device.

Any device except an iPhone, until Apple finally gets around to shipping WebGPU in Safari. Any year now...

80 comments