top | item 39894316

(no title)

There is an implication here that the Fortran implementation of `SGEMM` is somehow inadequate. But any modern Fortran compiler will quite easily apply the AVX and FMA optimizations presented here without any additional changes. Both GNU and Intel make these substitutions with the correct flags.

The unrolling optimization is also just another flag away (`-funroll-all-loops`). The Intel Compiler will even do this without prompting. In fact, it appears to only do a modest 2x unroll on my machine, suggesting that the extreme unroll in this article would have been overkill.

Parallelization certainly a lot to ask of Fortran 77 source, but there there is little stopping you from adding OpenMP statements to the `SGEMM` function. In fact, modern Fortran even offers its own parallelization constructs if you're willing to go there.

Which is to say: Let's not belittle this old Fortran 77 function. Yes it is old, and does not even resemble modern Fortran. But the whole point of Fortran is to free the developer from these platform-specific details, and hand the job off to the compiler. If you don't like that approach, then you're welcome to go to C or C++. But this little block of Fortran code is already capable of doing just about everything in this article.

discuss

steppi|1 year ago

The Fortran implementation is just a reference implementation. The goal of reference BLAS [0] is to provide relatively simple and easy to understand implementations which demonstrate the interface and are intended to give correct results to test against. Perhaps an exceptional Fortran compiler which doesn't yet exist could generate code which rivals hand (or automatically) tuned optimized BLAS libraries like OpenBLAS [1], MKL [2], ATLAS [3], and those based on BLIS [4], but in practice this is not observed.

Justine observed that the threading model for LLaMA makes it impractical to integrate one of these optimized BLAS libraries, so she wrote her own hand-tuned implementations following the same principles they use.

[0] https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogra...

[1] https://github.com/OpenMathLib/OpenBLAS

[2] https://www.intel.com/content/www/us/en/developer/tools/onea...

[3] https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Alg...

[4]https://en.wikipedia.org/wiki/BLIS_(software)

marshallward|1 year ago

Fair enough, this is not meant to be some endorsement of the standard Fortran BLAS implementations over the optimized versions cited above. Only that the mainstream compilers cited above appear capable of applying these optimizations to the standard BLAS Fortran without any additional effort.

I am basing these comments on quick inspection of the assembly output. Timings would be equally interesting to compare at each stage, but I'm only willing to go so far for a Hacker News comment. So all I will say is perhaps let's keep an open mind about the capability of simple Fortran code.

brrrrrm|1 year ago

using AVX/FMA and unrolling loops does extremely little in the way of compiling to fast (>80% peak) GEMM code. These are very much intro steps that don't take into account many important ideas related to cache hierarchy, uop interactions, and even instruction decode time. The Fortran implementation is entirely and unquestionably inadequate for real high performance GEMMs.

marshallward|1 year ago

I just did a test of OpenBLAS with Intel-compiled BLAS, and it was about 6 GFLOP/s vs 150 GFLOP/s, so I must admit that I was wrong here. Maybe in some sense 4% is not bad, but it's certainly not good. My faith in current compilers has certainly been shattered quite a bit today.

Anyway, I have come to eat crow. Thank you for your insight and helping me to get a much better perspective on this problem. I mostly work with scalar and vector updates, and do not work with matrices very often.

marshallward|1 year ago

I don't disagree, but where are those techniques presented in the article? It seems like she exploits the particular shape of her matrix to align better with cache. No BLAS library is going to figure that out.

I am not trying to say that a simple 50+ year old matrix solver is somehow competitive with existing BLAS libraries. But I disagreed with its portrayal in the article, which associated the block with NumPy performance. Give that to a 2024 Fortran compiler, and it's going to get enough right to produce reasonable bytecode.

pklausler|1 year ago

Modern Fortran's only parallel feature is coarrays, which operate at the whole program level.

DO CONCURRENT is a serial construct with an unspecified order of iterations, not a parallel construct. A DO CONCURRENT loop imposes requirements that allow an arbitrary order of iterations but which are not sufficient for safe parallelization.

marshallward|1 year ago

How do you feel about Nvidia endorsing do concurrent migration to GPUs? Would that be classified as parallelization?