Discovering faster matrix multiplication algorithms with reinforcement learning

[+] svnt|3 years ago|reply

The big claim turns out to be a little overstated. The claim:

> AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago.

reduces to:

> AlphaTensor discovers algorithms that outperform the Strassen-square algorithm, which is a fast algorithm for large square matrices31,32. Although the discovered algorithm has the same theoretical complexity as Strassen-square, it outperforms it in practice, as it is optimized for the considered hardware. Interestingly, AlphaTensor finds algorithms with a larger number of additions compared with Strassen-square (or equivalently, denser decompositions), but the discovered algorithms generate individual operations that can be efficiently fused by the specific XLA33 grouping procedure and thus are more tailored towards the compiler stack we use. The algorithms found by AlphaTensor also provide gains on matrix sizes larger than what they were optimized for. Finally, Fig. 5c shows the importance of tailoring to particular hardware, as algorithms optimized for one hardware do not perform as well on other hardware.

So they improved on Strassen’s 50-year-old algorithm by optimizing it for hardware that has only existed for a few years and de-optimizing it for human use? There is so much cool stuff here without claiming to do something you really didn’t.

[+] boole1854|3 years ago|reply

This is not a correct summary of the results of the paper.

First, you cut out the initial part of the sentence about improving on Strassen's two-level algorithm. Here is the complete sentence:

> Particularly relevant is the case of 4 × 4 matrices in a finite field, where AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago.

That is, note that the improvement cited in the latter part of the sentence is for the 4 x 4 matrix case.

But then your next quote is not in reference to 4 x 4 matrices. That is, for larger matrix sizes, the best algorithm AlphaTensor discovered had the same theoretical complexity as Strassen-square, but better performance.

EDIT: my next comments are confused and best ignored. See reply from pxx below.

For the 4 x 4 matrix case, the theoretical complexity of AlphaTensor's algorithm was O(N^2.778) compared with the Strassen algorithm's complexity of O(N^2.8074), which is an improvement.

[+] chabons|3 years ago|reply

Maybe I'm mis-reading the paper, but my interpretation was that the algorithm discovered using a reward which scales with HW performance matches the theoretical complexity of Strassen, but is more performant on the HW.

They also identified algorithms which do fewer matrix multiplications than Strassen, improving the lower bound of matrix multiplies required (They highlight this in Fig 3).

In that light, I thought their claim was fair. They've discovered (different) algorithms which are both theoretically and practically better than Strassen's.

[+] dekhn|3 years ago|reply

That's the DeepMind modus operandi- and why they publish in Nature. It's in their best interest to claim as much covered ground as possible so they can keep working towards their end-goal. But most people who read Nature already know that results in that journal are overhyped. You just have to apply a specific prior/translator when you read their papers.

[+] atty|3 years ago|reply

On the other hand, real world performance is really the only useful metric for matrix multiplication. I don’t really care about the theoretical performance, or the number of operations. Not disagreeing with your take that the claim is grandiose, just pointing out that finding a generalizable way to automate the improvement of what is almost certainly the most important mathematical operation a computer can do is worth some attention. It also suggests other mathematical operations could be improved this way as well - potentially algorithms that haven’t gotten as much theoretical or practical attention as matrix multiplication. With all that said, they even point out that other people have worked on this before using other optimization strategies. I’d guess they got into Nature either by using the Deepmind name, because Nature really loves reinforcement learning because it’s a cool topic that draws views, or because they’re the first group to find their algorithm leads to real improvements. (Probably a mixture of all three, I’d guess)

[+] throwawaymaths|3 years ago|reply

No, they count multiplications, 47 vs 49 in the 4x4 case. This assumes that multiplications are the heavy lift relative to additions, memory moves, etc. this is not a bad assumption, and is hardware independent, although there are some proposed 'exotic' hardware solutions where addition and multiplication are equally difficult (via lookup table, limited to about 8-bit x 8-bit) or where addition is harder than multiplication (log space number systems) but I don't think these have been deployed

[+] letitgo12345|3 years ago|reply

I think they find a N^2.778 complexity algorithm (obtained by applying their 4x4 discovered algorithm recursively) so this is not correct.

[+] Ar-Curunir|3 years ago|reply

You’re wrong. They discovered an algorithm with fewer multiplications for 4x4 matrices.

They also discovered better algorithms for other dimensions.

[+] ColinWright|3 years ago|reply

How can I reconcile your comment here with the comment in https://news.ycombinator.com/item?id=33098192 ??

[+] johndfsgdgdfg|3 years ago|reply

What do you expect from an ad company? Not the truth.

[+] xpe|3 years ago|reply

The paper is a great read -- an interesting approach, fun theoretical comparisons, plus benchmarks for optimized algorithms for specific hardware (CPU and GPU, allowing for differing instruction sets).

But it doesn't stop there; from the "Discussion" section:

> One important strength of AlphaTensor is its flexibility to support complex stochastic and non-differentiable rewards (from the tensor rank to practical efficiency on specific hardware), in addition to finding algorithms for custom operations in a wide variety of spaces (such as finite fields). We believe this will spur applications of AlphaTensor towards designing algorithms that optimize metrics that we did not consider here, such as numerical stability or energy usage.

[+] mikewarot|3 years ago|reply

I have a gut feeling that there is a faster way to compute logarithms going from least to most significant bit. How would I go about using ML to find it?

[Edit] I think Feynman's algorithm might do it:

"Consider the problem of finding the logarithm of a fractional number between 1 and 2. (The algorithm can be generalized without too much difficulty.) Feynman observed that any such number can be uniquely represented as a product of numbers of the form 1 + 2^(-k), where k is an integer. Testing for the presence of each of these factors in a binary representation is simply a matter of a shift and a subtraction. Once the factors are determined, the logarithm can be computed by adding together the precomputed logarithms of the factors. The algorithm fit the Connection Machine especially well because the small table of the logarithms of 1 + 2^(-k) could be shared by all the processors. The entire computation took less time than doing a division."

[+] adgjlsfhk1|3 years ago|reply

note that this algorithm will no longer be good. for precisions up to roughly 1000 bits small tables combined with minimax polynomials are optimal and for higher precision, you want to use more complicated methods. if you're interested, the ARB library unlikely the currently fastest known methods.

[+] radford-neal|3 years ago|reply

Seems similar to the CORDIC algorithm (https://en.wikipedia.org/wiki/CORDIC).

[+] xpe|3 years ago|reply

Start with these questions:

1. How big is the search space?

2. What analysis approaches are likely to bear fruit for the search space? (theoretical analysis? optimization?)

3. If optimization is called for, what kind?

[+] chillee|3 years ago|reply

Quoting myself on twitter:

https://twitter.com/cHHillee/status/1577713102434361344

I'm quite suspicious about their hardware benchmarks. They're not writing custom kernels, they're relying on a graph compiler like XLA to automatically fuse their decomposed matmuls (and my guess is that XLA will not be very good at this).

Moreover, as far as I can tell, they don't report absolute performance numbers anywhere. In other words, I suspect that a naive N^3 matrix multiplication would absolutely smoke them in performance.

[+] unknown|3 years ago|reply

[deleted]

[+] optimalsolver|3 years ago|reply

Now, were these algorithms discovered, or invented?

I.e., have they always been there, just sitting in Platonic space waiting for a conscious mind to stumble across them, or have they just now popped into existence?

[+] MichaelBurge|3 years ago|reply

The dovetail over all possible programs solves every problem with at most a constant slowdown.

So if it was invented, it was invented 100 years ago along with every other algorithm since the inventor of the dovetail incorporated it by reference. And there are no more algorithms to invent.

And if it was discovered, you would want to compare the efficiency of your discovery process with the dovetail.

So I tend to say "discovered with X bits of optimization power", where 0 bits reduces to the dovetail over some enumeration process, infinity bits reduces to "invention"(i.e. you consider one(or zero) object from the stream, constructing it directly), and everything in-between grades the search process.

[+] WFHRenaissance|3 years ago|reply

Discovered. I'd go as far to say that all "invention" is actually just discovery that we find useful.

[+] lawrenceyan|3 years ago|reply

> Leveraging this diversity, we adapted AlphaTensor to specifically find algorithms that are fast on a given hardware, such as Nvidia V100 GPU, and Google TPU v2. These algorithms multiply large matrices 10-20% faster than the commonly used algorithms on the same hardware, which showcases AlphaTensor’s flexibility in optimising arbitrary objectives.

10-20% performance improvement in matrix multiplications is pretty amazing[0]!

[0]: https://www.nature.com/articles/s41586-022-05172-4/figures/5

[+] gnufx|3 years ago|reply

That doesn't say exactly what was run and how. From measurements with nvblas, that difference could be dwarfed by the effect of tile size and pinning buffers or not.

[+] a-dub|3 years ago|reply

this is cool. i suppose it's only a matter of time until we see optimizing compilers that use transformer-rl style searches for subsets of their optimization and codegen.

[+] mhh__|3 years ago|reply

Generating pgo data with graph ML is already a thing

[+] Psyladine|3 years ago|reply

Nevermind compilers, the possibility of updating architectures & code bases by automation. A true feasible attack on tech debt and legacy code.

[+] danuker|3 years ago|reply

Will also generate some fun bugs!

[+] ColinWright|3 years ago|reply

Quoting:

> ... AlphaTensor finds an algorithm for multiplying 4×4 matrices using 47 multiplications in Z_2 , thereby outperforming Strassen’s two-level algorithm, which involves 7^2 = 49 multiplications. By applying this algorithm recursively, one obtains a practical matrix multiplication algorithm in Z_2 with complexity O(N^2.778).

> Moreover, AlphaTensor discovers efficient algorithms for multiplying matrices in standard arithmetic; for example, AlphaTensor finds a rank-76 decomposition of T_{4,5,5}, improving over the previous state-of-the-art complexity of 80 multiplications.

[+] HarHarVeryFunny|3 years ago|reply

Can someone explain the approach here?

I understand that they transform the problem into a gamified 3-D matrix decomposition, but what exactly is the motivation for using RL to beat this decomposition game ? Why not just use, for example, an evolution algorithm to find incrementally better decompositions ?

[+] gnufx|3 years ago|reply

There's commentary here which seems to be talking about matrix multiplication in general and ignoring real implementations, which are dominated by considerations of the memory hierarchy. They actually introduce operations, like packing for sufficiently large dimensions but not for smaller, and prefetch. Some of that seems only to be tunable empirically for the micro-architecture too. (I don't know how the considerations vary between CPU and GPU.) Fairly recent work on Strassen implementation, talks about that and the normal Goto-type algorithm: https://jianyuhuang.com/papers/sc16.pdf

[+] zekrioca|3 years ago|reply

Always caution someone who wants to follow in this research:

When somebody promotes a fast algorithm there is often a catch. In this case the issue is numerical stability. There is a theorem that states that any algorithm for n-by-n matrix-matrix multiplication that is componentwise forward stable (as good as it gets in this situation) much necessarily use n^3 scalar multiplications. The authors will therefore waste their time if they carry out their plans and try to optimize for stability. The standard algorithm has the nice property and no faster algorithm can have this property. The question of fast matrix multiplication was raised recently on mathoverflow.net, see https://mathoverflow.net/q/421304/110176 and the answers given there.

[+] WhitneyLand|3 years ago|reply

In general to speed up matrix operations how much is some sort of result caching applied?

For example if you profile calculations done when training a model, is there any significant repetition happening that would allow some kind of benefit from table lookups of certain solutions?

[+] ubik80|3 years ago|reply

Machine learning for finding matrix multiplication algorithms is not so new. The problem was solved before with other methods. For example:

https://www.researchgate.net/publication/341942316_Searching...

[+] didibus|3 years ago|reply

Can someone knowledgeable tell me if it discovered an algorithm we can understand and re-implement ourself, like is the pseudo-code for it known? Or is it kind of stuck in the infered function of the ML model?

[+] bugzz|3 years ago|reply

yes, they implemented the resulting algorithms and have charts showing real world performance on GPUs and TPUs

[+] puffoflogic|3 years ago|reply

Why the heck is a math paper in Nature? I'd put Nature somewhere just below vixra in terms of math credibility.

[+] 323|3 years ago|reply

If you could find a way to train cats to solve integrals, wouldn't that be a Nature paper and not a math one?

[+] bitwize|3 years ago|reply

It's begun. AI is improving itself.

[+] wbhart|3 years ago|reply

The new 4x4 matrix multiplication over F_2 has practical applications as many matrix operations over F_2 can be reduced to matrix multiplication.

For anyone looking for the algorithm itself, it is actually given in in one of the extended data sections at the very end of the paper.

[+] KKKKkkkk1|3 years ago|reply

Reinforcement learning is usually introduced as a technique to train an agent in simulation that would then be let loose in a real environment. Interesting that they choose to treat it is a search strategy.

[+] unknown|3 years ago|reply

[deleted]

[+] bjourne|3 years ago|reply

Never heard of the "matrix multiplication tensor" before. Is there some digestible blog post or article somewhere explaining what it is and why it is useful?

111 comments