top | item 46675102

(no title)

chillitom | 1 month ago

Also trying to let the compilers know that the float* are aligned would be a good move.

auto aligned_p = std::assume_aligned<16>(p)

discuss

order

magicalhippo|1 month ago

> let the compilers know that the float* are aligned

Reminded me of way back before OpenGL 2.0, and I was trying to get Vertex Buffer Objects working in my Delphi program using my NVIDIA graphics card. However it kept crashing occasionally, and I just couldn't figure out why.

I've forgotten a lot of the details, but either the exception message didn't make sense or I didn't understand it.

Anyway, after bashing my head for a while I had an epiphany of sorts. NVIDIA liked speed, vertices had to be manipulated before uploading to the GPU, maybe the driver used aligned SIMD instructions and relied on the default alignment of the C memory allocator?

In Delphi the default memory allocator at the time only did 4 byte aligned allocations, and so I searched and found that Microsoft's malloc indeed was default aligned to 16 bytes. However the OpenGL standard and VBO extension didn't say anything about alignment...

Manually aligned the buffers and voila, the crashes stopped. Good times.

Remnant44|1 month ago

which honestly, shouldn't be neccessary today with avx512. There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput, while if it_is_ aligned you will get the same performance as the aligned-only load.

No reason for the compiler to balk at vectorizing unaligned data these days.

dmpk2k|1 month ago

> There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput

Getting a fault instead of half the performance is actually a really good reason to prefer aligned load/store. To be fair, you're talking about a compiler here, but I never understood why people use the unaligned intrinsics...

Sesse__|1 month ago

Even with AVX512, memory arguments used in most instructions (those that are not explicitly unaligned loads) need to be aligned, no? E.g., for vaddps zmm0, zmm0, [rdi] (saving a register and an instruction over vmovups + vaddps reg, reg, reg), rdi must be suitably aligned.

Apart from that, there indeed hasn't been a real unaligned (non-atomic) penalty on Intel since Nehalem or something. Although there definitely is an extra cost for crossing a page, and I would assume also a smaller one for crossing a cache line—which is quite relevant when your ops are the same size as one!