Funny: "It is not clear that any compiler will ever use this instruction — it looks like it is designed for Kazushige Goto‘s personal use." https://en.wikipedia.org/wiki/Kazushige_Goto
This sounds like exactly the instruction needed for the inner loop of xcorr_kernel(), the function at the heart of a bunch of algorithms used in the Opus codec. This falls under the "convolution kernel" use-case described in the article.
This instruction is surely targeted at deep learning applications. Convolutonal layers take up the majority of the compute time of deep networks.
People seem optimistic that compilers will auto generate such instructions, but even if a compiler could generate the instruction, you would need to carefully organize your data structures to take advantage of it.
Pipelining is as important as SIMD in achieving peak flops on current processors. You can do 16 flops in a single fmadd instruction, in five cycles. But ten consecutive independent fmadds also take 5 cycles, but perform 160 flops. Getting a pipeline going like that requires very careful design of data structures by the programmer.
Current publicly announced AVX512 does not support fp16. Skylake Server (SKX) and Knights Landing (KNL) are at a disadvantage here. They've not publicly said anything about extensions in Knights Hill (the long announced successor to KNL).
That said, Intel have announced the emergency "Knights Mill" processor jammed into the roundmap between KNL and Knights Hill. It's specifically targeted at deep learning workloads and one might expect FP16 support. They had a bullet point suggesting 'variable' precision too. I would guess that means Williamson style variable fixed point. (I also guess that the Nervena "flexpoint" is a trademarked variant of it).
I assume the FPGA inference card supports fp16. And Lake Crest (the first Nervena chip sampling next year) will support flex point of course. I would expect subsequent Xeon / Lake Crest successor integrations to do the same.
Fun times..
Aside on the compiler work -- I think it's not that hard to emit this instruction at least for GEMM style kernels where it's relatively obvious.
Man, when I see this stuff I sure hope there is maturation of auto-vectorization at the compiler level in clang etc.
Even more useful would be compiler-level feedback of how to stay within the constraints needed to auto-vectorize your C/C++ for loop. (I need to make this data access const etc)
As far as I know, the Intel compiler is ahead of MSVC/clang on this front without reverting to OpenMP or other annotations on your code.
> Even more useful would be compiler-level feedback of how to stay within the constraints needed to auto-vectorize your C/C++ for loop. (I need to make this data access const etc)
Specifically for clang you can use the new opt-viewer [1] to annotate the source to find hints on what could/couldn't be optimized/vectorized. Requires compiler flags only found in 4.x/trunk, unfortunately.
Intel is abandoning their compiler infrastructure and moving everything to Clang/LLVM. This does mean pushing their autovectorization work into LLVM, although judging from the quality of conversation in the vectorization BoF at the latest developer's meeting, it's not clear how much work they wish to put in to actually making acceptable upstreamable patches.
The changes necessary to make efficient use of these instructions go well beyond just inner loops (e.g. struct of arrays vs. array of structs) that the compiler can't transparently perform for you.
MSVC provides that feedback. See https://msdn.microsoft.com/en-us/library/jj658585.aspx - the compiler option you want is /Qvec-report:2 which "Outputs an informational message for loops that are vectorized and for loops that are not vectorized, together with a reason code."
For effective use of such vector instructions, languages like C/C++ really fail at giving enough hints to their compiler. An advanced inspection tool in clang/gcc would certainly help humans to write compiler friendly code, but the real advance can only be taken with an improved programming language that designed specifically for such use. Perhaps the HN crowd is more knowledgable than me, but I fail to recognize any potentially useful language on the market to date. Perhaps Haskell or OCaml with compilers helped by advanced AI?
After 3 years of Xeon Phi I'm still waiting for OpenCL on Fortran with vector support so we can finally have a sane and performant programming model for these things. Instructions are neat, but the tooling support is just not there for widespread use IMO. If Intel had taken a more long term strategy oriented with OpenMP years ago, i.e. embracing accelerators of all kinds, instead of trying to hold on tight to it for market protection, I think they'd be in a better position now.
Great. Will Intel still disable those features on lower-end chips, thus ensuring that the market share for such features to make sense for developers won't be reached anytime soon after release?
This is a KNL-only feature to workaround a microarchitectural limitation (max two instruction issued per clock) to speed up a few specific benchmarks^Wworkloads.
Historically vector processors had vector registers but scalar ALU execution units (although possibly more than one). Vector instructions were "just" a way to make sure that the ALU was fed a new operation every cycle without instruction fetch and loop overhead. It also made it easier to pipeline reading from main memory (main memory latency wasn't so high a that time, so the large vector operations made it possible to pipeline reads with processing without stalling the CPU). None of those issues has been a bottleneck for a while and the memory subsystem of a modern computer is significantly different, so classic vector processors have fallen out of favour.
In contrast more modern SIMD machines normally have the vector execution units as wide as the register themselves and the advantage, in addition to 1 N-vector ALU being more power and area efficient than N scalar ones, is that, in an OoO machine, less in-flight instructions need to be tracked. It is also easier to take advantage of wider memory/cache busses.
Because classic vectors machines processed elements one at a time, it was possible to have efficient accumulating operations, which is significantly harder on proper SIMD processors (so called horizontal operations).
In the more mathematical sense the "vectors" are the matrices. Vector instructions are usually implemented as fixed-size SIMD operations that do some chunk-sized work on a problem. These new instructions would seemingly operate on an entire vector/matrix.
[+] [-] hughw|9 years ago|reply
[+] [-] derf_|9 years ago|reply
[+] [-] paulsutter|9 years ago|reply
People seem optimistic that compilers will auto generate such instructions, but even if a compiler could generate the instruction, you would need to carefully organize your data structures to take advantage of it.
Pipelining is as important as SIMD in achieving peak flops on current processors. You can do 16 flops in a single fmadd instruction, in five cycles. But ten consecutive independent fmadds also take 5 cycles, but perform 160 flops. Getting a pipeline going like that requires very careful design of data structures by the programmer.
Does anyone know if AVX512 will support fp16?
[+] [-] stuntprogrammer|9 years ago|reply
That said, Intel have announced the emergency "Knights Mill" processor jammed into the roundmap between KNL and Knights Hill. It's specifically targeted at deep learning workloads and one might expect FP16 support. They had a bullet point suggesting 'variable' precision too. I would guess that means Williamson style variable fixed point. (I also guess that the Nervena "flexpoint" is a trademarked variant of it).
I assume the FPGA inference card supports fp16. And Lake Crest (the first Nervena chip sampling next year) will support flex point of course. I would expect subsequent Xeon / Lake Crest successor integrations to do the same.
Fun times..
Aside on the compiler work -- I think it's not that hard to emit this instruction at least for GEMM style kernels where it's relatively obvious.
[+] [-] hackcrafter|9 years ago|reply
Even more useful would be compiler-level feedback of how to stay within the constraints needed to auto-vectorize your C/C++ for loop. (I need to make this data access const etc)
As far as I know, the Intel compiler is ahead of MSVC/clang on this front without reverting to OpenMP or other annotations on your code.
[+] [-] wyldfire|9 years ago|reply
Specifically for clang you can use the new opt-viewer [1] to annotate the source to find hints on what could/couldn't be optimized/vectorized. Requires compiler flags only found in 4.x/trunk, unfortunately.
[1] https://github.com/llvm-mirror/llvm/tree/master/utils/opt-vi...
[+] [-] jcranmer|9 years ago|reply
[+] [-] revelation|9 years ago|reply
[+] [-] StephanTLavavej|9 years ago|reply
[+] [-] jxy|9 years ago|reply
[+] [-] mpweiher|9 years ago|reply
[1] https://cr.yp.to/talks/2015.04.16/slides-djb-20150416-a4.pdf
[2] https://news.ycombinator.com/item?id=9396950
[3] https://news.ycombinator.com/item?id=9202858
[+] [-] bedros|9 years ago|reply
Linear-time Matrix Transpose Algorithms Using Vector Register File With Diagonal Registers
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13....
disclaimer: I'm the author.
[+] [-] m_mueller|9 years ago|reply
[+] [-] mtgx|9 years ago|reply
[+] [-] gpderetta|9 years ago|reply
[+] [-] WhitneyLand|9 years ago|reply
I thought "vector" in the context of CPU instructions just meant more than one.
Is there a definition where vector implies consecutive?
[+] [-] gpderetta|9 years ago|reply
In contrast more modern SIMD machines normally have the vector execution units as wide as the register themselves and the advantage, in addition to 1 N-vector ALU being more power and area efficient than N scalar ones, is that, in an OoO machine, less in-flight instructions need to be tracked. It is also easier to take advantage of wider memory/cache busses.
Because classic vectors machines processed elements one at a time, it was possible to have efficient accumulating operations, which is significantly harder on proper SIMD processors (so called horizontal operations).
[+] [-] wyldfire|9 years ago|reply
[+] [-] faragon|9 years ago|reply