CUDA-l2: Surpassing cuBLAS performance for matrix multiplication through RL

[+] j2kun|3 months ago|reply

They claim the algorithm "discovered" the new techniques, but the methods described in section 5 do not seem all that novel to me. It smells like it could be "laundering" the literature [1] and reshuffling existing techniques. This is not inherently a bad thing, but I would hope that if it is borrowing existing techniques, the appropriate citation would eventually make it into this paper.

[1]: https://www.argmin.net/p/lore-laundering-machines

[+] Q6T46nT668w6i3m|3 months ago|reply

You’re not kidding. I just looked. There isn’t anything novel in that section. I assumed from the description they found novel methods but this is standard GPU Gems advice.

[+] AlexCoventry|3 months ago|reply

In the future, we will all be Jürgen Schmidhuber. :-)

[+] alyxya|3 months ago|reply

There generally aren't new techniques when optimizing something ubiquitous. Instead, there are a lot of ways to apply existing techniques to create new and better results. Most ideas are built on top of the same foundational principles.

[+] alyxya|3 months ago|reply

The chart confused me because I expected to see performance numbers of CUDA-L2 compared to the others, but instead it shows a chart showing the speedup percentage of CUDA-L2 over the others. In some sense, the bar chart effectively inverts the performance of torch.matmul and cuBLAS with how much percentage it shows. 0% on the bar chart would only mean equal performance.

[+] stonogo|3 months ago|reply

Am I reading this wrong, or does this only support FP16 inputs, and compares its performance against an FP32 solver?

[+] Bulat_Ziganshin|3 months ago|reply

They compare HGEMM implementations. At least CUBLAS has HGEMM functions.

HGEMM means half-precision (i.e. FP16) general matrix multiplication

[+] roflmaostc|3 months ago|reply

> Q: What if I need matrix dimensions (M, N, K) not found in your configurations? >A: 1. You can find the nearest neighbor configuration (larger than yours) and pad with zeros. 2. Feel free to post your dimensions on GitHub issues. We are happy to release kernels for your configuration.

Lol, this will be potentially much slower than using the general matmul kernel.

However, I like this kind of research because it really exploits specific hardware configurations and makes it measurable faster (unlike some theoretical matmul improvements). Code specialization is cheap, and if it saves in the order of a few %, it quickly reimburses its price, especially for important things like matmul.

[+] konradha|3 months ago|reply

I've been trying my hand at RL envs for various sparse matrix algorithms in CUDA. It's easy to generate code that "looks good", "novel" and "fast". Escaping the distribution and actually creating novel sequences of instructions or even patterns (has any model come with something as useful as fan-in/fan-out or double buffering patterns that's now ubiquituous?) seems difficult to say the least.

[+] bgwalter|3 months ago|reply

[deleted]

[+] krapht|3 months ago|reply

This is a standard which few kernels will ever meet. I'd say requiring a numerical proof is the same as requiring no proof at all - because it won't ever happen unless you're validating silicon or something equally expensive.

15 comments