top | item 37945603

(no title)

almostdigital | 2 years ago

Here's the same benchmark with np.matmul instead of native python (on M2 MBP)

    Python             4.216 GFLOPS
    Naive:             6.400 GFLOPS            1.52x faster than Python
    Vectorized:       22.232 GFLOPS            5.27x faster than Python
    Parallelized:     52.591 GFLOPS           12.47x faster than Python
    Tiled:            60.888 GFLOPS           14.44x faster than Python
    Unrolled:         62.514 GFLOPS           14.83x faster than Python
    Accumulated:     506.209 GFLOPS          120.07x faster than Python

discuss

order

microtonal|2 years ago

Does that use Apple accelerate? Depending on the matrix size, that sees a bit low, even the M1 Pro can easily reach 2.2 TFLOPS.

gyrovagueGeist|2 years ago

What is your BLAS backend?

andy99|2 years ago

Yeah this is confusing for me: I'm non an expert in numpy * but I had assumed that it would do most of those things - vectorize, unroll, etc, either when compiled or through any backend it's using. I understand that numpy's routines are fixed and that mojo might have more flexibility, but for straight up matrix multiplication I'd be very surprised if it's really leaving that much performance on the table. Although I can appreciate that if it depends separately on what BLAS backend has been installed that is a barrier to getting default fast performance.

* For context I do have done some experience experimenting on the gcc/intel compiler options that are available for linear algebra, and even outside of BLAS, compiling with -o3 -ffast-math -funroll-loops etc does a lot of that, and for simple loops as in matrix vector multiplication, compilers can easily vectorize. I'm very curious if there is something I don't know about that will result in a speedup. See e.g. https://gist.github.com/rbitr/3b86154f78a0f0832e8bd171615236... for some basic playing around

almostdigital|2 years ago

Just whatever you get by default with pip install numpy... Changing the benchmark to run with a 1024x1024x1024 matrix instead of a 128x128x128 does speed up numpy significantly though

    Python           119.189 GFLOPS
    Naive:             6.275 GFLOPS            0.05x faster than Python
    Vectorized:       22.259 GFLOPS            0.19x faster than Python
    Parallelized:     50.258 GFLOPS            0.42x faster than Python
    Tiled:            59.692 GFLOPS            0.50x faster than Python
    Unrolled:         62.165 GFLOPS            0.52x faster than Python
    Accumulated:     565.240 GFLOPS            4.74x faster than Python
np.__config__:

    Build Dependencies:
      blas:
        detection method: pkgconfig
        found: true
        include directory: /opt/arm64-builds/include
        lib directory: /opt/arm64-builds/lib
        name: openblas64
        openblas configuration: USE_64BITINT=1 DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS=
          NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= SANDYBRIDGE MAX_THREADS=3
        pc file directory: /usr/local/lib/pkgconfig
        version: 0.3.23.dev
      lapack:
        detection method: internal
        found: true
        include directory: unknown
        lib directory: unknown
        name: dep4364960240
        openblas configuration: unknown
        pc file directory: unknown
        version: 1.26.1
    Compilers:
      c:
        commands: cc
        linker: ld64
        name: clang
        version: 14.0.0
      c++:
        commands: c++
        linker: ld64
        name: clang
        version: 14.0.0
      cython:
        commands: cython
        linker: cython
        name: cython
        version: 3.0.3
    Machine Information:
      build:
        cpu: aarch64
        endian: little
        family: aarch64
        system: darwin
      host:
        cpu: aarch64
        endian: little
        family: aarch64
        system: darwin
    Python Information:
      path: /private/var/folders/76/zy5ktkns50v6gt5g8r0sf6sc0000gn/T/cibw-run-27utctq_/cp310-macosx_arm64/build/venv/bin/python
      version: '3.10'
    SIMD Extensions:
      baseline:
      - NEON
      - NEON_FP16
      - NEON_VFPV4
      - ASIMD
      found:
      - ASIMDHP
      not found:
      - ASIMDFHM