top | item 26555585

Matrix multiplication inches closer to mythic goal

310 points| yboris | 5 years ago |quantamagazine.org | reply

217 comments

order
[+] thomasahle|5 years ago|reply
If you are interested in how those tensor methods work, the authors have another very nice paper that gives a great introduction https://arxiv.org/abs/1810.08671 It is actually not that different from Strassen after all. It just uses a bit more notation to help construct more clever ways of saving multiplication.
[+] Const-me|5 years ago|reply
I’m more interested why saving multiplications at the cost of more add/subtract helps overall?

On all modern CPUs their speed is the same for float numbers. Specifically, they have 4 cycles of latency on Skylake, 3 cycles of latency on Zen2, throughput (for 32-byte vectors) is 0.5 cycles for both CPUs.

[+] woopwoop|5 years ago|reply
On the off chance an expert reads this comment and feels like answering, is there a consensus opinion (or even just your opinion) on what the best result is likely to be? Do people think O(n^2) is actually possible?

Edit: shoulda read to the end, it says that Strassen says no.

[+] jessriedel|5 years ago|reply
The impression I had is that many experts believe that no single algorithm A can achieve O(n^2), but there exists a sequence of algorithms A_i that achieve O(n^{2+d_i}), with d_i>0 getting arbitrarily small, but with the algorithms getting arbitrarily long and the constant overhead factors out front getting arbitrarily large.

Unfortunately this was just word of mouth, and I might be misremembering.

[+] phkahler|5 years ago|reply
I'm no expert, but IMHO it feels like a problem that might have O(n^2 log(n)) time complexity. That would be more satisfying than some arbitrary non-integer exponent.
[+] pjsg|5 years ago|reply
There are related problems in computer science -- e.g. how quickly can you multiply a pair of 65536x65536 extended double precision matrices together. Note that each matrix consume ~68GB of memory, so you need (probably) at least 200GB to store the two input matrices and the output matrix. So far, so good.

Now you want to see how fast it can really go if you are prepared to spend (say) $100k on the hardware (e.g. 64 times as many CPUs with some fancy interconnect fabric). How about spending $100M? It turns out that moving all this data around is also quite costly.....

[+] Zenst|5 years ago|reply
>65536x65536 extended double precision matrices together

For perspective, that would be akin to multiplying every MAC address with every other MAC address upon the entire IPv4 range.

[+] goldenkey|5 years ago|reply
That's not a problem in C.S. That's a problem in engineering.
[+] yrral|5 years ago|reply
Does anyone know how matrix multiplications are implemented in CPUs GPUs or ML asics? Are there optimizations used or do the optimizations take too much circuitry and thus the n^3 method is still used?
[+] taxemeEvasion|5 years ago|reply
It's typically still the O(N^3) method that's implemented in OpenBLAS, BLIS, cuBLAS, MKL, etc. There are some high performance implementations of Strassen's Algorithm with a fixed level of recursion that start to perform much better as the matrix size gets large (see the work from UT Austin's BLIS). From my understanding for the more complex algorithms the hidden constant simply grows too large to be worth it on conventional hardware.

As another poster commented, these are O(N^3) in work, not necessarily the wall clock time you see due to parallel speedup and cache effects, but will scale as O(N^3) once you're at a size past all of these. These implementations are highly tuned and optimized for hardware, much much more so than the naive 3 loop implementation. The predictable and 'easy' access patterns make the simple algorithm hard to beat.

[+] AnotherGoodName|5 years ago|reply
Note that GPUS are almost always multiplying 4x4 matrices. Sure they multiply many matrices together but each is 4x4. The 4^3 complexity isn't at all an issue (64 steps) and the constant time for some of the n^2.x methods is high.
[+] OneDayIGoHome|5 years ago|reply
For GPUs, it's actually much faster than O(n^3) because computing each entry in the result matrix is independent. Hence, the problem is embarrassingly parallel in a way.

I don't know how to use O() notation for GPUs but it should be something like O(n^2/k^2) where K is the tile size [0].

Also lower memory bandwidth becomes a bottleneck here. So there is a lot of optimizations done on how to transfer from CPU to GPU and then within GPU to efficiently query the matrices.

[0]https://docs.nvidia.com/deeplearning/performance/dl-performa...

[+] balefrost|5 years ago|reply
However, the number of additions required to add those sets of matrices is much closer together. Generally, the number of additions is equal to the number of entries in the matrix, so four for the two-by-two matrices and 16 for the four-by-four matrices.

Maybe my linear algebra is REALLY rusty, but I that doesn't seem right to me.

For two 4x4 matrices, each element in the output matrix will be the sum of four products. So each of the 16 output elements will require 3 additions, and you'll need a total of 48 additions.

I'm not sure where they're getting "16 additions" from.

To multiply two nxn matrices according to the textbook method, I think you'll need n^3 multiplications and n^3-n^2 additions.

I don't doubt that the multiplication time dominates the algorithm. Maybe this is just a nitpick.

[+] soVeryTired|5 years ago|reply
They're counting the additions needed to add the matricies, not to multiply them. The point being that matrix addition is order n^2 so even if you need to do a large number of matrix additions in some intermediate calculation, they don't hurt you as n gets large if you're trying to speed up matrix multiplication.
[+] ovi256|5 years ago|reply
> “Exponent two” refers to the ideal speed — in terms of number of steps required — of performing one of the most fundamental operations in math: matrix multiplication. If exponent two is achievable, then it’s possible to carry out matrix multiplication as fast as physically possible.

Can anyone clarify what "as physically possible" means here? Are there physical systems that perform matrix multiplication in O(n^2) ? (I feel silly even asking that about a physical, non-computation system)

I remember there were applications of optical computation, including a startup that wanted to accelerate deep network inference through optical computation: they had a way to do matrix multiplication using a projector (for the data matrix) and a 3D printed lens (for the network weights).

[+] jwmerrill|5 years ago|reply
I don’t think they’re actually intending to refer to physics here—they could have said “as fast as theoretically possible” instead.

A square matrix with n rows and columns has n^2 entries, so if you need to look at all the entries in two matrices to compute their product, then you need to do O(n^2) work just to look at all the relevant entries.

The only way you could get around this is if you know many entries are 0, or identical, or have some other structure such that you don’t need to look at them separately.

[+] sdenton4|5 years ago|reply
I don't think the sentence is very good.

You need n^2 time just to read the two matrices, so I always thought of that as a natural lower bound. (Of course, we're talking about multiplications eventually here, so may not be exactly what they meant...)

[+] klyrs|5 years ago|reply
An (n x n) square systolic array can perform matrix multiplication in time O(n) (not counting I/O, which takes O(n^2) without more specialty hardware) -- this is how TPUs work. Simulate the same algorithm on a single processor and it takes O(n^3). Use the same hardware to compute larger matrix products and it only gives a constant factor speedup.
[+] hansvm|5 years ago|reply
The other answers are good (it takes O(n^2) time just to read two matrices, so you can't possibly multiply them more quickly than that), but they leave out a critical part of that analysis that I want to highlight just in case it would have otherwise bitten you in the future:

You _do_ actually have to read all the entries in both matrices -- if you know K-1 out of K values then in general you still can't uniquely compute the output of matrix multiplication. Consequently, any algorithm relying on only some subset of the entries is at best an approximation.

If you have more information (e.g., that one or both of the matrices is sufficiently sparse) then you can potentially do better.

[+] OneDayIGoHome|5 years ago|reply
I think if you could somehow start computing the resultant matrix elements as soon as you read a row/column from the input ones, you could reach their "physically possible" limit.

A couch expert on computer architecture here, but a large enough systolic array could be used to achieve their "physically possible" limit? [0]

New CUDA GPUs have been coming with these tensor cores that are just systolic arrays. Google's TPU are the same.

Could someone with more knowledge on systolic arrays comment on whether a large systolic array can achieve this?

[0] https://medium.com/intuitionmachine/googles-ai-processor-is-...

[+] slt2021|5 years ago|reply
O(N^2) is the time you need to write down the correct answer, provided that it takes 0 time to calculate each cell in resulting matrix
[+] aliceryhl|5 years ago|reply
It's badly phrased. What they mean is that we know that it will take at least n^2 operations, and that it would be nice to find an algorithm that is actually that fast.
[+] JulianChastain|5 years ago|reply
The article doesn't specify, and I don't have the prior knowledge to understand the paper; Is there a coded implementation of this in some language or is it purely a mathematical description of how theoretically multiplication can always be done with slightly fewer multiplications?
[+] ableal|5 years ago|reply
Is anyone looking into the memory accesses for the operation?

Fifty years ago multiplication was vastly slower than RAM access. Nowadays it's pratically free compared to even getting data out of cache ...

[+] MauranKilom|5 years ago|reply
Nobody is remotely looking at running these algorithms in the real world. These are very much asymptotic complexities, and to draw benefit from even the 1990 version you'd have to literally use galactic input sizes. None of this is about making the matrix multiplications on your PC faster.

https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_a...

[+] joosters|5 years ago|reply
Shouldn't that first graph have the axes the other way round? Time (date) should be on the x-axis, and the measurement on the y-axis? As it is, it looks terribly confusing to me.
[+] CalChris|5 years ago|reply
If n^2 is possible then it would be possible for a given n, say 4. Can't it be shown by example or exhaustive search that this is or isn't possible for 4?
[+] dragontamer|5 years ago|reply
O(n^2) is a scaling factor: what happens as n->infinity.

An optimal arrangement for n==4 already exists. Both AMD MI100 and NVidia Volta / Ampere perform 4x4 FP16 matrix multiplication in a single assembly language statement.

An 8x8 matrix is just an arrangement of four 4x4 matricies (!!!!). As such, you can define a 8x8 matrix multiplication as a 4x4 matrix multiplication over 4x4 matricies. This recursive relationship is often key to how they manage to get faster-and-faster. Getting to O(n^2.8), or O(n^2.6,) etc. etc.

[+] kadoban|5 years ago|reply
Since we're looking for a O(n^2) algorithm, not specifically n^2 operations, this actually doesn't help. You could for example need 1000n^2+1000000 operations, which would look terrible if you only look at n=4, but you're golden if n=10000000 (and even if there's no practical n where you win, it's still theoretically very interesting).
[+] melling|5 years ago|reply
I am doing Andrew Ng’s ML Coursera course with Matlab.

I’ve now got the crazy desire to see matrices and vectors built into every language in a clean and succinct way.

[+] tasty_freeze|5 years ago|reply
One of the unexpected things is Dartmouth BASIC from 1964 had matrix primitives, despite being an otherwise primitive language. MS never took up the feature, but I know Wang BASIC did on their 2200 family of machines.

For example:

    10 DIM A(3,3),B(3,3),C(3,3)
    20 MAT READ A
    30 DATA 1,-5,7, .45,23,9, -5,-1,0
    40 MAT B=A:REM assignment
    50 MAT B=ZER:REM zero array
    60 MAT B=CON:REM all 1s
    70 MAT B=IDN:REM identity
    80 MAT B=TRN(A):REM transpose
    90 MAT B=(2)*A:REM scale
    100 MAT B=INV(A),D:REM invert array, determinant in scalar D
    110 MAT C=A*B:REM matrix multiplication
    120 MAT PRINT C:REM prints something close to an IDN array
[+] montalbano|5 years ago|reply
Convenient matrix algebra is one of the selling points of Julia.
[+] taxemeEvasion|5 years ago|reply
It's a main reason why Fortran still has such a hold on the SciComp/HPC community.
[+] teruakohatu|5 years ago|reply
The base type in R is a vector. A number such as 3.14 is simply a vector of length 1. So functions are vectorized by default (as long as you don't do any operations that are explicitly not vectorizable).

Once you get your head around everything being a vector, R can be a lot of fun.

I love Julia but I wish I didnt need to explicitly broadcast functions across a vector.

[+] make3|5 years ago|reply
Yes, have you tried Julia? It is pretty much how you describe. There are other things that I don't like about that language, but the matrix part is really great
[+] mkaic|5 years ago|reply
Just finished the course and couldn't agree more. Everything seemingly is faster with matrices!
[+] centimeter|5 years ago|reply
Beyond matrices and vectors, it would be nice to have generalized (typed) tensors. What you can do with matrices alone is really somewhat limited.
[+] fastball|5 years ago|reply
> no, no, no, no, no

Confirmed Strassen thinks the lower bound is n^2.32192809489.

[+] lsb|5 years ago|reply
log2(5)???
[+] dewiz|5 years ago|reply
How would one calculate the $$$ value of exp2 multiplication? Eg if one finds a way, would it be worth keeping it a secret and creating a business around it?
[+] dekhn|5 years ago|reply
in the time spent improving matrix multiplication by a tiny incremental amount, many gigawatt hours have been spent making existing CPUs, GPUs, and TPUs do matrix multiplication using far less efficient algorithms. It's unclear this work would really be important in production computing (I understand the theoretical desire) compared to time spent making other things work better.
[+] The_rationalist|5 years ago|reply
are there any non-galactic (practical) algorithm faster than strassen (for non-small dense matrices) ?
[+] foolinaround|5 years ago|reply
quanta magazine has been the best find for me this year!
[+] vessenes|5 years ago|reply
Man, Quanta magazine is a gift. I love that there is a publication with a significantly higher reading level assumed than most pop science content, but not written for industry insiders, it’s just great.

I pitched them a paper magazine a few years ago, because I wish I could leave these around for my kids to flip through, but no joy yet..