top | item 27902969

Intel Distribution for Python

139 points| EntICOnc | 4 years ago |software.intel.com

82 comments

order
[+] Rd6n6|4 years ago|reply
> the Intel CPU dispatcher does not only check which instruction set is supported by the CPU, it also checks the vendor ID string. If the vendor string says "GenuineIntel" then it uses the optimal code path. If the CPU is not from Intel then, in most cases, it will run the slowest possible version of the code, even if the CPU is fully compatible with a better version.[1]

I’ve been a little shy about using intel software since reading about this years ago

[1] https://www.agner.org/optimize/blog/read.php?i=49

[+] microtonal|4 years ago|reply
This has gotten a bit better. Last time I checked, MKL now uses Zen-specific kernels for sgemm/dgemm. Unfortunately, these kernels are slower than the AVX2 kernels. But at least, it does not use the pre-modern SIMD kernels for AMD Zen anymore.

Edit, comparison:

    $ perf record target/release/gemm-benchmark  -d 1024
    Threads: 1
    Iterations per thread: 1000
    Matrix shape: 1024 x 1024
    GFLOPS/s: 96.36
    $ perf report --stdio -q | head -n3
        97.18%  gemm-benchmark  gemm-benchmark      [.] mkl_blas_def_sgemm_kernel_0_zen
         1.94%  gemm-benchmark  gemm-benchmark      [.] mkl_blas_def_sgemm_scopy_down16_bdz
         0.78%  gemm-benchmark  gemm-benchmark      [.] mkl_blas_def_sgemm_scopy_right4_bdz
After disabling Intel CPU detection:

    $ perf record target/release/gemm-benchmark  -d 1024
    Threads: 1
    Iterations per thread: 1000
    Matrix shape: 1024 x 1024
    GFLOPS/s: 129.12
    $ perf report --stdio -q | head -n3
        97.02%  gemm-benchmark  libmkl_avx2.so.1        [.] mkl_blas_avx2_sgemm_kernel_0
         1.77%  gemm-benchmark  libmkl_avx2.so.1        [.] mkl_blas_avx2_sgemm_scopy_down24_ea
         1.02%  gemm-benchmark  libmkl_avx2.so.1        [.] mkl_blas_avx2_sgemm_scopy_right4_ea
Benchmarked using https://github.com/danieldk/gemm-benchmark and oneMKL 2021.3.0.
[+] SavantIdiot|4 years ago|reply
That's just plain sinister.

I'm really surprised popular numerical computing Python packages don't already have optimized hardware back-ends for things like NumPy... similar to ORC (OIL) which has been around for quite some time:

https://github.com/GStreamer/orc

But I don't know that much about Python under the hood, and I'm willing to be since so many academics work on this there's already optimized FFIs. I've used TensorFlow and it can offload tensor math to GPUs, but only NVIDIA's AFAIK.

[+] mushufasa|4 years ago|reply
There is a longstanding issue around MKL and OpenBLAS optimization flags making intel systems artificially faster than amd ones for numpy computations. https://stackoverflow.com/questions/62783262/why-is-numpy-wi...

If there are true optimizations to be had, wonderful. But those should be added to core binaries pypi / conda. I am worried that Intel here may be trying to again artificially segment their optimization work on their math libraries for business rather than technical reasons.

[+] gnufx|4 years ago|reply
At least single-threaded "large" OpenBLAS GEMM has always been similar to MKL once it has the micro-architecture covered. If there's some problem with the threaded version (which one?), has it been reported like it would be for use in Julia? Anyway, on AMD, why wouldn't you use AMD's BLAS (just a version of BLIS). That tends to do well multi-threaded, though I'm normally only interested in single-threaded performance. I don't understand why people are so obsessed with MKL, especially when they don't measure and understand the measurements.
[+] thunkshift1|4 years ago|reply
What do you mean by ‘artificially faster’?
[+] pletnes|4 years ago|reply
From a practical perspective you have to use some BLAS library. If there is a working alternative from AMD, it would be great if you share it. They did have one in the past although I don’t recall its name.
[+] dsign|4 years ago|reply
Thanks for bringing out that link, I had had that nagging question about how specific Intel performance libraries were to Intel hardware. At least in this case, it seems not much.
[+] jxy|4 years ago|reply
That SO performance benchmark would be so much more useful if the OP had also run OpenBlas on the xeon.
[+] mistrial9|4 years ago|reply
what, no Debian/Ubuntu ? sigh
[+] mhh__|4 years ago|reply
Do AMD even have optimized packages available? Don't get me wrong, I'm not a huge fan of what Intel get up to but AMD's profiling software is dreadful so I'm not exactly surprised that Intel don't even entertain the option.
[+] bananaquant|4 years ago|reply
Quite unsurprisingly, this distribution has no support for ARM: https://software.intel.com/content/www/us/en/develop/article...

I once was excited about Intel releasing their own Linux distro (Clear Linux), but it has the same problem. It looks like Intel is trying to make custom optimized versions of popular open-source projects just to get people to use their CPUs, as they lose their leadership in hardware.

[+] mumblemumble|4 years ago|reply
I'm not sure I see why you would expect anything different? The entire point of this framework is to provide a bunch of tools for squeezing the most you can out of SSE, which is specific to x86.

I don't know if there's an ARM-specific equivalent, but, if you want to use TensorFlow or PyTorch or whatever on ARM, they'll work quite happily with the Free Software implementations of BLAS & friends. If you code at an appropriately high level, the nice thing about these libraries is that you get to have vendor-specific optimizations without having to code against vendor-specific APIs. Which is great. I sincerely wish I had that for the vector-optimized code I was writing 20 years ago. In any case, if ARM Holdings or a licensee wants to code up their own optimized libraries that speak the same standard APIs (and assuming they haven't already), that would be awesome, too. The more the merrier. How about we all get in on the vendor-optimized libraries for standard APIs bandwagon. Who doesn't want all the vendor-specific optimizations without all the vendor lock-in?

Alternatively, if you would rather get really good and locked in to a specific vendor, you could opt instead to spam the CUDA button. That's a popular (and, as far as I'm concerned, valid, if not necessarily suited to my personal taste) option, too.

[+] smoldesu|4 years ago|reply
"Their" CPUs meaning x86 platforms, in this case.

Plus, who's surprised? This is how Intel makes money. The consumer segment is a plaything for them, the real high-rollers are in the server segment, where they butter them up with fancy technology and the finest digital linens. Is it dumb? A little, but it's hardly a "problem" unless you intended to ship this software on first-party hardware which, hint-hint, the license forbids in the first place.

At the end of the day, this doesn't really irk me. I can buy a compatible processor for less than $50, that's accessible enough.

[+] gnufx|4 years ago|reply
Clear Linux looked unconvincing to me. When I looked at their write-up, the example of what they say they do with vectorization was FFTW. That depends on hand-coded machine-specific stuff for speed, and the example was actually for the testing harness, i.e. quite irrelevant. I did actually run the patching script for amusement.
[+] mhh__|4 years ago|reply
Alder Lake looks seriously impressive if the rumoured performance is even close to accurate, so I wouldn't count them out just yet - that being said, they will never get a run like they did over the last 10 years again.
[+] vitorsr|4 years ago|reply
You can easily try it yourself [1]:

    conda create -n intel -c intel intel::intelpython3_core
Or [2]:

    docker pull intelpython/intelpython3_core
Note that it is quite bloated but includes many high-quality libraries.

You can think of it as a recompilation in addition to a collection of patches to make use of their proprietary libraries.

Other useful links to reduce the noise in this thread: [3], [4], [5], [6].

[1] https://software.intel.com/content/www/us/en/develop/article...

[2] https://software.intel.com/content/www/us/en/develop/article...

[3] https://www.nersc.gov/assets/Uploads/IntelPython-NERSC.pdf

[4] https://hub.docker.com/u/intelpython

[5] https://anaconda.org/intel

[6] https://github.com/IntelPython

[+] tkinom|4 years ago|reply
Any benchmarks comparison data?

   For example:   .... benchmarks with this python is XXX % higher than ... (std python, AMD, ARM)
[+] _joel|4 years ago|reply
Why are they making their own distro and not putting code back into mainline if it's useful? Do they have some particular IP that makes this impossibe?
[+] LeifCarrotson|4 years ago|reply
Here's the list of CPUs which incorporate the AVX2 instructions that enable some of these optimizations:

https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPU...

You could write your distro to check for flags that will tell it whether or not you have these using flags from /proc/cpuinfo. Or you could check whether it's in the Intel half of the list or the AMD half of the list. Or you could write your own distro that only runs on the first half of the list.

I get that Intel's contributions aren't purely altruistic. There are likely to be subtle tuning problems that require slight changes to optimize on different platforms, and they can't really be expected to do free work for AMD. But it looks to me like they're being unecessarily anticompetitive.

[+] SkipperCat|4 years ago|reply
I think there is a pretty big base of people who do big data work using Numpy and Pandas (Fintech, etc). They want to squeeze every bit of computing power out of the specific Intel chipset, GPUs, etc and Intel's distro really helps them out.

A 10% speed improvement on 1000's of jobs could in theory save you a nice chunk of time. This becomes very important in the financial market where you need batch jobs to be finished before markets open, or you just want to save 10% on your EC2 bill.

[+] TOMDM|4 years ago|reply
To me this just looks like Intel saw what Nvidia has accomplished with CUDA, locking in large portions of the scientific computing community with a hardware specific API and going "yeah me too thanks"

Thankfully, accelerated math libraries already exist for Python without the vendor lockin.

[+] bostonsre|4 years ago|reply
Intel has been releasing mkl/math kernel libraries for Java for a really long time. Hopefully core python devs can learn a few tricks and similar changes can make it upstream.
[+] rshm|4 years ago|reply
Looks like recompilation. I am guessing gains are on numpy and scipy. For python heavy code base, i doubt it can be performant than pypy.
[+] ciupicri|4 years ago|reply
Python 3.7.4 when 3.10 is just around the block.
[+] amelius|4 years ago|reply
Maybe I'm missing something but it seems to me that this can only cause fragmentation in the Python space.

Why not use the original distributions?

[+] lbhdc|4 years ago|reply
There are a number of alternate interpreters available. The selling point typically is that they are faster, and seems to be the value proposition of intels.

One use might be improving throughput of a compute bound system, like an etl written in python, with little effort. Ideally just downloading the new interpreter.

[+] gnufx|4 years ago|reply
I don't know what Intel did for the proprietary version, but the first thing you should do for Python is to compile with GCC's -fno-semantic-interposition. I don't know if there's a benefit from vectorization, for instance, in parts of the interpreter, or whether -Ofast helps generally if so, but I doubt there's anything Intel CPU-specific involved if there is. I've never looked at it, has the interpreter not been well-profiled and such optimizations provided? Anyway, if you want speed, don't use Python.

It's obviously not relevant to Python per se, but you get basically equivalent performance to MKL with OpenBLAS or, perhaps, BLIS, possibly with libxsmm on x86. BLIS may do better on operations other than {s,d}gemm, and/or threaded, than OpenBLAS, but they're both generally competitive.

[+] black_puppydog|4 years ago|reply
So I see Intel and Microsoft both like naming things the Wrong(TM) way around? This name makes about as much sense as WSL... :D
[+] hallgrim|4 years ago|reply
We tried using intel python in one of my previous data science jobs, and ultimately gave up because compatibility with some packages from pip was a nightmare. Alas I can’t quite remember exactly what went wrong.
[+] agloeregrets|4 years ago|reply
I wonder who the person is who saw python and was like "You know what this needs? INTEL."