(no title)
sYnfo | 1 year ago
2) chunks with too many small numbers cannot be processed with just 2 shuffle-adds
3) (not mentioned in the post) HighLoad limits the size of the source code you can submit, so you can't put all possible values in the look-up table
camel-cdr|1 year ago
Dwedit|1 year ago
rurban|1 year ago
Which would be much faster than the checked add (adc). Does any hardware support such checked SIMD arithmetic already?
Or can you still assume that most arithmetic is still broken in most languages/libraries.
adrian_b|1 year ago
At each step from Larrabee to Skylake Server, some instructions have been lost, because the initial set of instructions was more complete in order to enable the writing of efficient GPU algorithms, while later Intel believed that for a general-purpose CPU they can reduce the costs by omitting some of those instructions.
(Nevertheless, later they have added many other instructions, some of which may be less useful and more expensive than the original instructions that have been removed.)
Among the original Larrabee instructions that have been deleted, was addition with unsigned overflow (a.k.a. carry), where the output overflow flags were stored in a mask register, enabling their use in a later conditional SIMD instruction.
Signed overflow can be implemented in hardware with negligible additional complexity (a single gate per each result number), so it would have been easy to also add to Larrabee/AVX-512 an addition instruction with signed overflow flags stored in a mask register. Even when only unsigned overflow is available, it is possible to preprocess the operands in such a way that detecting signed overflow would be possible with the unsigned overflow bits, though that requires multiple instructions, slowing down a lot the algorithm.
However in this problem the numbers that are added are non-negative, so the addition with unsigned overflow of the original Larrabee ISA would have been sufficient, had Intel not removed it from AVX-512.
kardos|1 year ago
sYnfo|1 year ago