I skimmed your post and I wonder if mojo is focusing on such small 512x512 matrices? What is your thinking on generalizing your results for larger matrices?
I think for a compiler it makes sense to focus on small matrix multiplies, which are a building block of larger matrix multiplies anyways. Small matrix multiplies emphasize the compiler/code generation quality. Even vanilla python overhead might be insignificant when gluing small-ish matrix multiplies together to do a big multiply.
dsharlet|2 years ago