I wouldn't be holding my breath though - proper support of high-level portable SIMD abstraction requires quite a lot of compiler complexity due to how wide (heh) the API surface of SIMD extensions is in most ISAs, and because of details necessary to get right to keep data in appropriate (vector and/or mask) registers. This, naturally, goes in the complete opposite direction to the design philosophy of Go's compiler. Instead, you are supposed to write a custom Go ASM syntax, with byte literals used to encode opcodes if they are not natively supported (which is common).
https://lemire.me/blog/2024/07/05/scan-html-faster-with-simd... (uses platform intrisics, but showcases that you can go one abstraction level lower, retaining the same Vector128<T> type if you need to specialize a particular part of your algorithm for a platform, without having to maintain separate copy for each one)
First time I hear about HighLoad. Seems really interesting to me on the first glance. I personally find SIMD and ISA/μarch-specific optimizations more rewarding than pure algorithmic challenges (codeforces and such).
Though Haswell seems like a pretty obsolete platform to optimize for at this point. Even Skylake will be a decade old next year.
Realistically beyond Haswell there hasn’t been a ton of advancement in SIMD. Hawell introduced AVX2, which is what this blog post uses. AVX512 is certainly more powerful, but that’s not even available in the majority of Intel CPUs, even brand new ones.
To be clear, it’s not dereferencing unmapped memory, I just haven’t shown how it’s being mapped, because it’s a little complex. As I note in the post, you can imagine as if I mmap all the necessary addresses at the start of the program.
ashleyn|1 year ago
Thinking parallel gets you enormous speed benefits for any number of arbitrary algorithms: https://mcyoung.xyz/2023/11/27/simd-base64/
neonsunset|1 year ago
I wouldn't be holding my breath though - proper support of high-level portable SIMD abstraction requires quite a lot of compiler complexity due to how wide (heh) the API surface of SIMD extensions is in most ISAs, and because of details necessary to get right to keep data in appropriate (vector and/or mask) registers. This, naturally, goes in the complete opposite direction to the design philosophy of Go's compiler. Instead, you are supposed to write a custom Go ASM syntax, with byte literals used to encode opcodes if they are not natively supported (which is common).
If you're interested in what high-effort SIMD implementation in this kind of language looks like, take a look at C#'s cross-platform Vector API: https://github.com/dotnet/runtime/blob/main/docs/coding-guid...
https://lemire.me/blog/2024/07/05/scan-html-faster-with-simd... (uses platform intrisics, but showcases that you can go one abstraction level lower, retaining the same Vector128<T> type if you need to specialize a particular part of your algorithm for a platform, without having to maintain separate copy for each one)
Here's high-effort vectorized CRC64 implementation that uses these: https://github.com/dotnet/runtime/blob/283de5b5adf08c42d4945... (performs as fast as C++-based mnemonic variant)
dist1ll|1 year ago
Though Haswell seems like a pretty obsolete platform to optimize for at this point. Even Skylake will be a decade old next year.
sgerenser|1 year ago
unknown|1 year ago
[deleted]
xipix|1 year ago
wolf550e|1 year ago
sYnfo|1 year ago
the9|1 year ago
[deleted]
raldi|1 year ago
sYnfo|1 year ago
2) chunks with too many small numbers cannot be processed with just 2 shuffle-adds
3) (not mentioned in the post) HighLoad limits the size of the source code you can submit, so you can't put all possible values in the look-up table
genter|1 year ago
That's terrifying
yuhong|1 year ago
[deleted]
the9|1 year ago
[deleted]