top | item 45583017

(no title)

nullbyte | 4 months ago

This code looks like an alien language to me. Or maybe I'm just rusty at C.

discuss

SIMD intrinsics are less C and more assembly with overlong mnemonics and a register allocator, so even reading them is something of a separate skill. Unlike the skill of achieving meaningful speedups by writing them (i.e. low-level optimization), it’s nothing special, but expect to spend a lot of time jumping between the code and the reference manuals[1,2] at first.

[1] https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

[2] https://developer.arm.com/architectures/instruction-sets/int...

ashtonsix|4 months ago

The weirdness probably comes from heavy use of "SIMD intrinsics" (Googleable term). These are functions with a 1:1 correspondence to assembly instructions, used for processing multiple values per instruction.

ack_complete|4 months ago

This is partially due to the compromises of mappingvector intrinsics into C (with C++ only being marginally better). In a more vector-oriented language, such as shader languages, this:

  s1 = vaddq_u32(s1, vextq_u32(z, s1, 2));
  s1 = vaddq_u32(s1, vdupq_laneq_u32(s0, 3));

would be more like this:

  s1.xy += s1.zw;
  s1 += s0.w;

mananaysiempre|4 months ago

To be fair, even in standard C11 you can do a bit better than the CPU manufacturer’s syntax

  #define vaddv(A, B) _Generic((A),
      int8x8_t:    vaddv_s8((A), (B)),
      uint8x8_t:   vaddv_u8((A), (B)),
      int8x16_t:   vaddvq_s8((A), (B)),
      uint8x16_t:  vaddvq_u8((A), (B)),
      int16x4_t:   vaddv_s16((A), (B)),
      uint16x4_t:  vaddv_u16((A), (B)),
      int16x8_t:   vaddvq_s16((A), (B)),
      uint16x8_t:  vaddvq_u16((A), (B)),
      int32x2_t:   vaddv_s32((A), (B)),
      uint32x2_t:  vaddv_u32((A), (B)),
      float32x2_t: vaddv_f32((A), (B)),
      int32x4_t:   vaddvq_s32((A), (B)),
      uint32x4_t:  vaddvq_u32((A), (B)),
      float32x4_t: vaddvq_f32((A), (B)),
      int64x2_t:   vaddvq_s64((A), (B)),
      uint64x2_t:  vaddvq_u64((A), (B)),
      float64x2_t: vaddvq_f64((A), (B)))

while in GNU C you can in fact use normal arithmetic and indexing (but not swizzles) on vector types.