top | item 24860688

A Programmer’s Intuition for Matrix Multiplication

299 points| sebg | 5 years ago |betterexplained.com | reply

90 comments

order
[+] notsuoh|5 years ago|reply
This comes up occasionally. I don't find it to be particularly helpful, even though I do think betterexplained has been a strong source of gaining intuition into various subjects.

Recognizing though that method of understanding something is pretty personal, I will say that what really helped me refresh was the 3Blue1Brown Essence of Linear Algebra series on Youtube. If you're trying to better grasp the subject, do yourself a favor and check out those videos.

https://www.3blue1brown.com/essence-of-linear-algebra-page/

[+] Yajirobe|5 years ago|reply
Everyone's seen/knows them already
[+] hchasestevens|5 years ago|reply
I've always found the bipartite graph conceptualization of matrix multiplication the most intuitive, and especially so if you're familiar with neural networks: https://www.math3ma.com/blog/matrices-probability-graphs
[+] crdrost|5 years ago|reply
That is genuinely a really fun way to look at it, thank you for linking this! Especially this fits nicely with Markov matrices where you have N input nodes and N output nodes and the sum of all of the probabilities coming out of one of the nodes needs to equal 1.

What I might find a little more difficult to teach to people through this lens is the phenomenon of eigenvectors, but I suppose that's to be expected—nothing will work well for all purposes.

[+] ssivark|5 years ago|reply
Was just coming in to point to this. While the geometric intuition in the post being discussed is quite visual, thinking of matrix elements as directed arrows generalizes easily to high-dimensional vector spaces, and also makes it very easy to understand the behavior of sparse transformation matrices, and motivates a nice correspondence between linear algebra and graph algorithms (ref. graphblas)
[+] bee_rider|5 years ago|reply
This seems more accurate than the original post, which seems quite tied to this code/data analogy.

OTOH, it is just vectors in a many-dimensional space. We're neigh-supernatural machines for describing vectors, that's how we throw things better than any other animal, I don't really know why we need analogies here. Why do we want x'x? It maps to a distance...

[+] tsjq|5 years ago|reply
Thanks for linking to this.
[+] activatedgeek|5 years ago|reply
I'd prefer stating that matrices actually don't mean anything in isolation. They are simply representations of linear maps. What happens to the elements of a matrix before the elements are formed is irrelevant to the matrix itself.

..And to be clear, I am not a purist.

[+] WJW|5 years ago|reply
Purist test: aren't vectors just Nx1 matrices?

Joking aside, matrices are linear maps in the context of multiplication. They can absolutely have meanings assigned outside the usual "when multiplied by a vector" type of calculations. For example the "adjacency matrix" representation of a graph is very often never multiplied by a vector even in academic algorithm descriptions. YMMV if you call it a (lookup) table in that context but the term "matrix" seems fairly well established there.

[+] d_tr|5 years ago|reply
Agreed. I think that the distinction between the objects and their possible representations should be made clear from the beginning, simply because it will make things easier to understand.

IMO, just a couple of chapters from an introductory pure mathematics book is actually the most "to the point" place to start even for people who are not huge fans of math. Just the precise definition of a vector space over a field, a linear operator, a vector basis, the derivation of matrix elements, the usual related examples and some "gymnastics" on paper can make stuff "click" and be a great start for any intended application.

[+] siraben|5 years ago|reply
When I took a course in linear algebra, we learned about linear transformations and vector spaces in the abstract first. Then when we got to matrices it was viewed as a way to represent linear transformations. Also it led to a natural derivation for matrix multiplication.
[+] da39a3ee|5 years ago|reply
> However, sometimes the matrix being operated on is not a linear operation, but a set of vectors or data points.

In Jeremy Kun's book [1] he argues that in fact data matrices _can_ be viewed as linear transformations. See e.g. section 10.9 on the SVD

> That is, we’re saying the input is R3 and the basis vectors are people:...

> By representing the ratings this way, we’re imposing the hypothesis that the _process_ of rating movies is linear in nature. ...

[1] https://pimbook.org/

[+] justjonathan|5 years ago|reply
I wholehearted endorse 3B1B, The linear algebra series is excellent, the rest of his work is too!
[+] shoo|5 years ago|reply
In some applications, e.g. modelling something defined by PDE, you might need to solve a very large linear system of equations Ax=b, where the vector b has 100,000,000 elements, and A is a 100,000,000 by 100,000,000 element matrix. So for a naive dense matrix representation of A using 1 byte per entry you'd need around 9 petabytes of storage. In practise, for many interesting systems of equations that encode "local" properties, A will be a sparse matrix where the overwhelming majority of matrix entries are zero. There are various sparse matrix formats designed to avoid storing zeroes: the simplest is COO where you store triples of the form (i, j, A_ij) for all nonzero entries A_ij of A. More efficient formats are CSR or CSC ( compressed sparse row & column respectively) that use even less memory by avoiding storing repeated indices where there are runs of nonzero elements.

To solve one of these huge linear systems Ax = b , we hardly ever want [1] to go to the expense of computing and storing an explicit matrix representation of the inverse A^{-1} . Instead, we merely need to have some function f that is able to map an input vector b to some output vector f(b)=x that happens to satisfy Ax=b .

If the matrix A has sufficiently nice properties [2] we can use an efficient iterative method such as the conjugate gradient method to map the input vector b to an output vector x by solving Ax=b. To evaluate the conjugate gradient method, we don't need to necessarily have an explicit dense or sparse representation of the matrix A, we merely need a function that encodes how A acts on a vector by matrix multiplication. E.g. some function g that maps an input vector x to the output vector g(x) = Ax.

So if we've got some appropriately defined conjugate gradient method algorithm cg(g, b), we can define the function f(b) := cg(g, b)

Calling f(b) will return x that satisfies g(x) = b, i.e. f is the inverse of g. So now we're expressing linear algebra without having any explicit matrices anywhere: but our functions g and f are of course both linear operators that encode the action of matrix multiplication by A and A^{-1} respectively.

[1] Why not? It's a lot more effort to compute A^{-1} which could be used to evaluate A^{-1}b for any given b, but often we only want to know A^{-1}b for a specific b. Also, if the matrix A encodes some local property of a physical system , then A^{-1} will encode a global property -- so it will need to be a dense matrix.

[2] Sufficiently nice properties: if it's symmetric positive definite, which arises naturally in a bunch of situations. See eg https://en.m.wikipedia.org/wiki/Conjugate_gradient_method

[+] activatedgeek|5 years ago|reply
For the more machine learning-minded folks, this happens almost always in doing inference with Exact Gaussian Processes (GP), where because of the non-parametric nature of a GP model, the covariance matrix grows with the size of data points. The inference routine is cubic in the number of data points. Hence, sparsity in the posited covariance matrix is _extremely_ important for fast inference.
[+] unnouinceput|5 years ago|reply
And what happens if indeed the huge matrix elements are all non-zero? Like, let's say, satellite scan data for a given country when you want to spy on their underground systems (think North Korea facilities)? Wouldn't storing that data as COO would actually triple the amount of memory?
[+] zests|5 years ago|reply
Matrices are a fun way to stretch the mind. They are easy to grasp yet very dissimilar to numbers. Sometimes they can be used to represent numbers and we can use our matrix intuition to answer questions about regular numbers.

You can define functions on matrices just like you define them on numbers. Multiplication is a function. The determinant is a function. You can even take derivatives of these functions. The derivative of the determinant function is written in terms of the trace of the matrix. Yet another matrix function. Who would have guessed that the derivative and trace are related in this way? There's really a lot of depth here.

https://en.wikipedia.org/wiki/Jacobi%27s_formula

[+] OneGuy123|5 years ago|reply
If we're talking about 3D rendering then a matrix multiplication with a vector are just projections: A 3x3 matrix that transforms a vector is nothing else than that vector being projected onto the 3 axis which are inside the 3x3 matrix. This is very easy to see visually.

A matrix times matrix multiplication (eg 3x3 times 3x3) is just projecting the axis of one matrix onto the other: expressing the coordinate system in terms of a different coordinate system.

The 4x3 (or 4x4) matrix is needed for translation and 3d projection for a 3d triangle onto a 2d plane (screen pixels).

For translation: it's just a hack since if you write it out it's a nice and a fast hack to simulate "movement".

Then for projection you just hack the numbers in the matrix in such a way that you shrink things further apart from the origin.

TLDR: for 3d graphics matrices are used because you can hack them to represent any kind of a transformation. The reason they are used is because graphics cards are fast at performing dot products. Plus matrix multiplication doesn't require trigonometry or division so it doesn't require the slower transcendental GPU instructions (sin, cos,...).

For other uses their interpretations is different.

I always hated the dull mathematic definition of "a matrix represents a linear map" because what the matrix actually represents is completely context dependent.

[+] markisus|5 years ago|reply
The projection viewpoint for a rotation matrix using dot products is totally valid, but I think the column viewpoint is more intuitive.

A 3x3 matrix can be viewed as something that acts on the standard unit basis vectors. The action on the standard unit basis vectors can be read out by looking at the columns. (A matrix times the ith standard basis vector is just the ith column.) Since the transform is linear, all the other points in space are carried along with the action as well. This is the "linear map" viewpoint of a rotation matrix.

I show an example here https://biro.ai/we-learned-the-wrong-way-to-matrix-multiply/ where this viewpoint lets you instantly see what a rotation matrix does by looking at the columns.

[+] d_tr|5 years ago|reply
> I always hated the dull mathematic definition of "a matrix represents a linear map" because what the matrix actually represents is completely context dependent.

The projection and translation matrices you mentioned represent restricted sets of linear maps in some vector space, but the meaning of the restriction is also captured in other mathematical structures. See affine spaces and projective spaces.

[+] Koshkin|5 years ago|reply
Sure - remember those noisy dot matrix printers?..
[+] Waterluvian|5 years ago|reply
Does anyone mind suggesting a solid intro to matrices in the context of what they are and how the affine transformations work? I know of these concepts but I think it's time to learn.
[+] Exuma|5 years ago|reply
OMG Thank you! I'm going through tons of linear algebra for a ML algorithm I need to understand and matrix multiplication is ... weird to say the least.
[+] adamnemecek|5 years ago|reply
Matrix multiplication is like convolution.
[+] activatedgeek|5 years ago|reply
To be sure there isn't a perspective here that I may be unaware of, is this a reference to circulant matrices, which bear the Toeplitz sparsity structure?
[+] salty_biscuits|5 years ago|reply
Maybe convolution on a discrete basis can be represented as a matrix multiplication with a circulant matrix?
[+] julian_adams|5 years ago|reply
Can you write up a physical intuition for the usage of matrix?
[+] Koshkin|5 years ago|reply
A rotation in space by an angle can be written as a matrix.
[+] Tycho|5 years ago|reply
What is some good intuition around matrix inverses?
[+] joppy|5 years ago|reply
The matrix inverse “undoes whatever the matrix did” is pretty much the most straightforward interpretation, if you’re thinking of a matrix as performing some kind of rotation/shear/scaling of space.

There are some other interpretations, for example if you think of a linear system Ax=b as a kind of “checking machine” (does x satisfy Ax=b? Yes/no, try again), then the inverse of A can produce solutions, since x=(A inverse)b.