If you are interested in how those tensor methods work, the authors have another very nice paper that gives a great introduction https://arxiv.org/abs/1810.08671
It is actually not that different from Strassen after all. It just uses a bit more notation to help construct more clever ways of saving multiplication.
I’m more interested why saving multiplications at the cost of more add/subtract helps overall?
On all modern CPUs their speed is the same for float numbers. Specifically, they have 4 cycles of latency on Skylake, 3 cycles of latency on Zen2, throughput (for 32-byte vectors) is 0.5 cycles for both CPUs.
On the off chance an expert reads this comment and feels like answering, is there a consensus opinion (or even just your opinion) on what the best result is likely to be? Do people think O(n^2) is actually possible?
Edit: shoulda read to the end, it says that Strassen says no.
The impression I had is that many experts believe that no single algorithm A can achieve O(n^2), but there exists a sequence of algorithms A_i that achieve O(n^{2+d_i}), with d_i>0 getting arbitrarily small, but with the algorithms getting arbitrarily long and the constant overhead factors out front getting arbitrarily large.
Unfortunately this was just word of mouth, and I might be misremembering.
Strassen has shown that the bound is higher than n^2 using a tensor rank bound. [1] The exact exponent bound is usually called omega in the literature, and there are good estimates for the value of omega. It is also related to other quantities like Grothendieck constant [2].
I'm no expert, but IMHO it feels like a problem that might have O(n^2 log(n)) time complexity. That would be more satisfying than some arbitrary non-integer exponent.
There are related problems in computer science -- e.g. how quickly can you multiply a pair of 65536x65536 extended double precision matrices together. Note that each matrix consume ~68GB of memory, so you need (probably) at least 200GB to store the two input matrices and the output matrix. So far, so good.
Now you want to see how fast it can really go if you are prepared to spend (say) $100k on the hardware (e.g. 64 times as many CPUs with some fancy interconnect fabric). How about spending $100M? It turns out that moving all this data around is also quite costly.....
Does anyone know how matrix multiplications are implemented in CPUs GPUs or ML asics? Are there optimizations used or do the optimizations take too much circuitry and thus the n^3 method is still used?
It's typically still the O(N^3) method that's implemented in OpenBLAS, BLIS, cuBLAS, MKL, etc. There are some high performance implementations of Strassen's Algorithm with a fixed level of recursion that start to perform much better as the matrix size gets large (see the work from UT Austin's BLIS). From my understanding for the more complex algorithms the hidden constant simply grows too large to be worth it on conventional hardware.
As another poster commented, these are O(N^3) in work, not necessarily the wall clock time you see due to parallel speedup and cache effects, but will scale as O(N^3) once you're at a size past all of these. These implementations are highly tuned and optimized for hardware, much much more so than the naive 3 loop implementation. The predictable and 'easy' access patterns make the simple algorithm hard to beat.
Note that GPUS are almost always multiplying 4x4 matrices. Sure they multiply many matrices together but each is 4x4. The 4^3 complexity isn't at all an issue (64 steps) and the constant time for some of the n^2.x methods is high.
For GPUs,
it's actually much faster than O(n^3) because computing each entry in the result matrix is independent. Hence, the problem is embarrassingly parallel in a way.
I don't know how to use O() notation for GPUs but it should be something like O(n^2/k^2) where K is the tile size [0].
Also lower memory bandwidth becomes a bottleneck here. So there is a lot of optimizations done on how to transfer from CPU to GPU and then within GPU to efficiently query the matrices.
However, the number of additions required to add those sets of matrices is much closer together. Generally, the number of additions is equal to the number of entries in the matrix, so four for the two-by-two matrices and 16 for the four-by-four matrices.
Maybe my linear algebra is REALLY rusty, but I that doesn't seem right to me.
For two 4x4 matrices, each element in the output matrix will be the sum of four products. So each of the 16 output elements will require 3 additions, and you'll need a total of 48 additions.
I'm not sure where they're getting "16 additions" from.
To multiply two nxn matrices according to the textbook method, I think you'll need n^3 multiplications and n^3-n^2 additions.
I don't doubt that the multiplication time dominates the algorithm. Maybe this is just a nitpick.
They're counting the additions needed to add the matricies, not to multiply them. The point being that matrix addition is order n^2 so even if you need to do a large number of matrix additions in some intermediate calculation, they don't hurt you as n gets large if you're trying to speed up matrix multiplication.
> “Exponent two” refers to the ideal speed — in terms of number of steps required — of performing one of the most fundamental operations in math: matrix multiplication. If exponent two is achievable, then it’s possible to carry out matrix multiplication as fast as physically possible.
Can anyone clarify what "as physically possible" means here? Are there physical systems that perform matrix multiplication in O(n^2) ? (I feel silly even asking that about a physical, non-computation system)
I remember there were applications of optical computation, including a startup that wanted to accelerate deep network inference through optical computation: they had a way to do matrix multiplication using a projector (for the data matrix) and a 3D printed lens (for the network weights).
I don’t think they’re actually intending to refer to physics here—they could have said “as fast as theoretically possible” instead.
A square matrix with n rows and columns has n^2 entries, so if you need to look at all the entries in two matrices to compute their product, then you need to do O(n^2) work just to look at all the relevant entries.
The only way you could get around this is if you know many entries are 0, or identical, or have some other structure such that you don’t need to look at them separately.
You need n^2 time just to read the two matrices, so I always thought of that as a natural lower bound. (Of course, we're talking about multiplications eventually here, so may not be exactly what they meant...)
An (n x n) square systolic array can perform matrix multiplication in time O(n) (not counting I/O, which takes O(n^2) without more specialty hardware) -- this is how TPUs work. Simulate the same algorithm on a single processor and it takes O(n^3). Use the same hardware to compute larger matrix products and it only gives a constant factor speedup.
The other answers are good (it takes O(n^2) time just to read two matrices, so you can't possibly multiply them more quickly than that), but they leave out a critical part of that analysis that I want to highlight just in case it would have otherwise bitten you in the future:
You _do_ actually have to read all the entries in both matrices -- if you know K-1 out of K values then in general you still can't uniquely compute the output of matrix multiplication. Consequently, any algorithm relying on only some subset of the entries is at best an approximation.
If you have more information (e.g., that one or both of the matrices is sufficiently sparse) then you can potentially do better.
I think if you could somehow start computing the resultant matrix elements as soon as you read a row/column from the input ones, you could reach their "physically possible" limit.
A couch expert on computer architecture here, but a large enough systolic array could be used to achieve their "physically possible" limit? [0]
New CUDA GPUs have been coming with these tensor cores that are just systolic arrays. Google's TPU are the same.
Could someone with more knowledge on systolic arrays comment on whether a large systolic array can achieve this?
It's badly phrased. What they mean is that we know that it will take at least n^2 operations, and that it would be nice to find an algorithm that is actually that fast.
Related to separate paper on using a different method for linear equation solving from a few weeks ago https://www.quantamagazine.org/new-algorithm-breaks-speed-li... seems likely if you can use a different method for solving linear equations, it may be that some matrix multiplication can also be improved using a different method.
The article doesn't specify, and I don't have the prior knowledge to understand the paper; Is there a coded implementation of this in some language or is it purely a mathematical description of how theoretically multiplication can always be done with slightly fewer multiplications?
Strassen's algorithm is still the fastest to be widely implemented. The Coppersmith-Winograd variants that have achieved better asymptotic complexity are all galactic algorithms: https://en.wikipedia.org/wiki/Galactic_algorithm.
Nobody is remotely looking at running these algorithms in the real world. These are very much asymptotic complexities, and to draw benefit from even the 1990 version you'd have to literally use galactic input sizes. None of this is about making the matrix multiplications on your PC faster.
Shouldn't that first graph have the axes the other way round? Time (date) should be on the x-axis, and the measurement on the y-axis? As it is, it looks terribly confusing to me.
If n^2 is possible then it would be possible for a given n, say 4. Can't it be shown by example or exhaustive search that this is or isn't possible for 4?
O(n^2) is a scaling factor: what happens as n->infinity.
An optimal arrangement for n==4 already exists. Both AMD MI100 and NVidia Volta / Ampere perform 4x4 FP16 matrix multiplication in a single assembly language statement.
An 8x8 matrix is just an arrangement of four 4x4 matricies (!!!!). As such, you can define a 8x8 matrix multiplication as a 4x4 matrix multiplication over 4x4 matricies. This recursive relationship is often key to how they manage to get faster-and-faster. Getting to O(n^2.8), or O(n^2.6,) etc. etc.
Since we're looking for a O(n^2) algorithm, not specifically n^2 operations, this actually doesn't help. You could for example need 1000n^2+1000000 operations, which would look terrible if you only look at n=4, but you're golden if n=10000000 (and even if there's no practical n where you win, it's still theoretically very interesting).
One of the unexpected things is Dartmouth BASIC from 1964 had matrix primitives, despite being an otherwise primitive language. MS never took up the feature, but I know Wang BASIC did on their 2200 family of machines.
For example:
10 DIM A(3,3),B(3,3),C(3,3)
20 MAT READ A
30 DATA 1,-5,7, .45,23,9, -5,-1,0
40 MAT B=A:REM assignment
50 MAT B=ZER:REM zero array
60 MAT B=CON:REM all 1s
70 MAT B=IDN:REM identity
80 MAT B=TRN(A):REM transpose
90 MAT B=(2)*A:REM scale
100 MAT B=INV(A),D:REM invert array, determinant in scalar D
110 MAT C=A*B:REM matrix multiplication
120 MAT PRINT C:REM prints something close to an IDN array
The base type in R is a vector. A number such as 3.14 is simply a vector of length 1. So functions are vectorized by default (as long as you don't do any operations that are explicitly not vectorizable).
Once you get your head around everything being a vector, R can be a lot of fun.
I love Julia but I wish I didnt need to explicitly broadcast functions across a vector.
Yes, have you tried Julia? It is pretty much how you describe. There are other things that I don't like about that language, but the matrix part is really great
How would one calculate the $$$ value of exp2 multiplication? Eg if one finds a way, would it be worth keeping it a secret and creating a business around it?
in the time spent improving matrix multiplication by a tiny incremental amount, many gigawatt hours have been spent making existing CPUs, GPUs, and TPUs do matrix multiplication using far less efficient algorithms. It's unclear this work would really be important in production computing (I understand the theoretical desire) compared to time spent making other things work better.
Man, Quanta magazine is a gift. I love that there is a publication with a significantly higher reading level assumed than most pop science content, but not written for industry insiders, it’s just great.
I pitched them a paper magazine a few years ago, because I wish I could leave these around for my kids to flip through, but no joy yet..
[+] [-] thomasahle|5 years ago|reply
[+] [-] Const-me|5 years ago|reply
On all modern CPUs their speed is the same for float numbers. Specifically, they have 4 cycles of latency on Skylake, 3 cycles of latency on Zen2, throughput (for 32-byte vectors) is 0.5 cycles for both CPUs.
[+] [-] woopwoop|5 years ago|reply
Edit: shoulda read to the end, it says that Strassen says no.
[+] [-] jessriedel|5 years ago|reply
Unfortunately this was just word of mouth, and I might be misremembering.
[+] [-] sn41|5 years ago|reply
[1] http://www.thi.informatik.uni-frankfurt.de/~jukna/Strassen-V...
[2] https://arxiv.org/pdf/1711.04427.pdf
[+] [-] phkahler|5 years ago|reply
[+] [-] pjsg|5 years ago|reply
Now you want to see how fast it can really go if you are prepared to spend (say) $100k on the hardware (e.g. 64 times as many CPUs with some fancy interconnect fabric). How about spending $100M? It turns out that moving all this data around is also quite costly.....
[+] [-] Zenst|5 years ago|reply
For perspective, that would be akin to multiplying every MAC address with every other MAC address upon the entire IPv4 range.
[+] [-] goldenkey|5 years ago|reply
[+] [-] yrral|5 years ago|reply
[+] [-] taxemeEvasion|5 years ago|reply
As another poster commented, these are O(N^3) in work, not necessarily the wall clock time you see due to parallel speedup and cache effects, but will scale as O(N^3) once you're at a size past all of these. These implementations are highly tuned and optimized for hardware, much much more so than the naive 3 loop implementation. The predictable and 'easy' access patterns make the simple algorithm hard to beat.
[+] [-] AnotherGoodName|5 years ago|reply
[+] [-] OneDayIGoHome|5 years ago|reply
I don't know how to use O() notation for GPUs but it should be something like O(n^2/k^2) where K is the tile size [0].
Also lower memory bandwidth becomes a bottleneck here. So there is a lot of optimizations done on how to transfer from CPU to GPU and then within GPU to efficiently query the matrices.
[0]https://docs.nvidia.com/deeplearning/performance/dl-performa...
[+] [-] balefrost|5 years ago|reply
Maybe my linear algebra is REALLY rusty, but I that doesn't seem right to me.
For two 4x4 matrices, each element in the output matrix will be the sum of four products. So each of the 16 output elements will require 3 additions, and you'll need a total of 48 additions.
I'm not sure where they're getting "16 additions" from.
To multiply two nxn matrices according to the textbook method, I think you'll need n^3 multiplications and n^3-n^2 additions.
I don't doubt that the multiplication time dominates the algorithm. Maybe this is just a nitpick.
[+] [-] soVeryTired|5 years ago|reply
[+] [-] ovi256|5 years ago|reply
Can anyone clarify what "as physically possible" means here? Are there physical systems that perform matrix multiplication in O(n^2) ? (I feel silly even asking that about a physical, non-computation system)
I remember there were applications of optical computation, including a startup that wanted to accelerate deep network inference through optical computation: they had a way to do matrix multiplication using a projector (for the data matrix) and a 3D printed lens (for the network weights).
[+] [-] jwmerrill|5 years ago|reply
A square matrix with n rows and columns has n^2 entries, so if you need to look at all the entries in two matrices to compute their product, then you need to do O(n^2) work just to look at all the relevant entries.
The only way you could get around this is if you know many entries are 0, or identical, or have some other structure such that you don’t need to look at them separately.
[+] [-] sdenton4|5 years ago|reply
You need n^2 time just to read the two matrices, so I always thought of that as a natural lower bound. (Of course, we're talking about multiplications eventually here, so may not be exactly what they meant...)
[+] [-] klyrs|5 years ago|reply
[+] [-] hansvm|5 years ago|reply
You _do_ actually have to read all the entries in both matrices -- if you know K-1 out of K values then in general you still can't uniquely compute the output of matrix multiplication. Consequently, any algorithm relying on only some subset of the entries is at best an approximation.
If you have more information (e.g., that one or both of the matrices is sufficiently sparse) then you can potentially do better.
[+] [-] OneDayIGoHome|5 years ago|reply
A couch expert on computer architecture here, but a large enough systolic array could be used to achieve their "physically possible" limit? [0]
New CUDA GPUs have been coming with these tensor cores that are just systolic arrays. Google's TPU are the same.
Could someone with more knowledge on systolic arrays comment on whether a large systolic array can achieve this?
[0] https://medium.com/intuitionmachine/googles-ai-processor-is-...
[+] [-] slt2021|5 years ago|reply
[+] [-] aliceryhl|5 years ago|reply
[+] [-] CoryOndrejka|5 years ago|reply
[+] [-] JulianChastain|5 years ago|reply
[+] [-] nonameiguess|5 years ago|reply
[+] [-] ableal|5 years ago|reply
Fifty years ago multiplication was vastly slower than RAM access. Nowadays it's pratically free compared to even getting data out of cache ...
[+] [-] MauranKilom|5 years ago|reply
https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_a...
[+] [-] joosters|5 years ago|reply
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] CalChris|5 years ago|reply
[+] [-] dragontamer|5 years ago|reply
An optimal arrangement for n==4 already exists. Both AMD MI100 and NVidia Volta / Ampere perform 4x4 FP16 matrix multiplication in a single assembly language statement.
An 8x8 matrix is just an arrangement of four 4x4 matricies (!!!!). As such, you can define a 8x8 matrix multiplication as a 4x4 matrix multiplication over 4x4 matricies. This recursive relationship is often key to how they manage to get faster-and-faster. Getting to O(n^2.8), or O(n^2.6,) etc. etc.
[+] [-] kadoban|5 years ago|reply
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] melling|5 years ago|reply
I’ve now got the crazy desire to see matrices and vectors built into every language in a clean and succinct way.
[+] [-] tasty_freeze|5 years ago|reply
For example:
[+] [-] montalbano|5 years ago|reply
[+] [-] taxemeEvasion|5 years ago|reply
[+] [-] teruakohatu|5 years ago|reply
Once you get your head around everything being a vector, R can be a lot of fun.
I love Julia but I wish I didnt need to explicitly broadcast functions across a vector.
[+] [-] make3|5 years ago|reply
[+] [-] mkaic|5 years ago|reply
[+] [-] centimeter|5 years ago|reply
[+] [-] bobbybabylon|5 years ago|reply
[+] [-] fastball|5 years ago|reply
Confirmed Strassen thinks the lower bound is n^2.32192809489.
[+] [-] lsb|5 years ago|reply
[+] [-] dewiz|5 years ago|reply
[+] [-] dekhn|5 years ago|reply
[+] [-] The_rationalist|5 years ago|reply
[+] [-] foolinaround|5 years ago|reply
[+] [-] vessenes|5 years ago|reply
I pitched them a paper magazine a few years ago, because I wish I could leave these around for my kids to flip through, but no joy yet..