top | item 33300574

(no title)

rcgorton | 3 years ago

Re: register windows. I disagree: code size wasn't the killer here, it was how DEEP the stack got. If your architectural register window spilled at 4 deep, then calls 3 deep were fine, but if you had a set of code attempting to iterate over a tight loop which had 8 calls deep, you were in [performance] trouble.

Another divot: asymmetric functional units. Some versions of Alpha supported a PopCount instruction, but it only worked in a single functional unit, which made scheduling a pain, esp. if you had to write in assembly language.

I'm not convinced that AVX 256 and AVX 512 are useful for non-matrix operations. Most strings (more importantly, parsing bounded by whitespace) are much shorter than 512 bits (32 bytes). In English, I cannot come up with many words longer than 16 bytes (some place names, antidisestablishmentarianism, chemical compound names, and some other stuff)

discuss

order

loup-vaillant|3 years ago

> I'm not convinced that AVX 256 and AVX 512 are useful for non-matrix operations.

I've observed that compared to regular x86-64 code without SIMD, using AVX 256 speeds up the Chacha20 cipher (for long messages so they can be processed in 512-bytes chuncks (8 blocks)) by a factor of 5. Network packets easily exceed 1KB, and files are usually much bigger.

Matrix operations aren't the only viable niche.