top | item 40955306

(no title)

sYnfo | 1 year ago

1) if you set BATCH_SIZE > 14 sums_acc may overflow

2) chunks with too many small numbers cannot be processed with just 2 shuffle-adds

3) (not mentioned in the post) HighLoad limits the size of the source code you can submit, so you can't put all possible values in the look-up table

discuss

order

camel-cdr|1 year ago

Couldn't you organize the accumulators in 8 byte chunks, and leave the upper byte unused. Then you map consecutive digits to those chunks and use 64 bit addition for the accumulation. Then overflow between the bytes would keep the correct result if you do the shuffles correctly, and you have a full byte of overflow buffer.

Dwedit|1 year ago

Gaps in the numbers are often enough to do some kind of "SIMD" even on ordinary 32-bit processors.

rurban|1 year ago

So SIMD would need to set the overflow flag also to catch em.

Which would be much faster than the checked add (adc). Does any hardware support such checked SIMD arithmetic already?

Or can you still assume that most arithmetic is still broken in most languages/libraries.

adrian_b|1 year ago

AVX-512 has evolved from the Larrabee New Instructions (2009), passing through Knights Ferry (2010), Knights Corner (2012) and Knights Landing (2016), to reach Skylake Server (2017), whose set of AVX-512 instructions has remained a subset of the instruction sets of all later CPUs with AVX-512 support.

At each step from Larrabee to Skylake Server, some instructions have been lost, because the initial set of instructions was more complete in order to enable the writing of efficient GPU algorithms, while later Intel believed that for a general-purpose CPU they can reduce the costs by omitting some of those instructions.

(Nevertheless, later they have added many other instructions, some of which may be less useful and more expensive than the original instructions that have been removed.)

Among the original Larrabee instructions that have been deleted, was addition with unsigned overflow (a.k.a. carry), where the output overflow flags were stored in a mask register, enabling their use in a later conditional SIMD instruction.

Signed overflow can be implemented in hardware with negligible additional complexity (a single gate per each result number), so it would have been easy to also add to Larrabee/AVX-512 an addition instruction with signed overflow flags stored in a mask register. Even when only unsigned overflow is available, it is possible to preprocess the operands in such a way that detecting signed overflow would be possible with the unsigned overflow bits, though that requires multiple instructions, slowing down a lot the algorithm.

However in this problem the numbers that are added are non-negative, so the addition with unsigned overflow of the original Larrabee ISA would have been sufficient, had Intel not removed it from AVX-512.

kardos|1 year ago

For 1, can you raise that to 28 with unsigned accumulators?

sYnfo|1 year ago

14 already assumes unsigned accumulator! 255 [accumulator capacity] / (2 [shuffle-adds] * 9 [highest digit value]) ~= 14