top | item 41212853

(no title)

sm_1024 | 1 year ago

That is fair, I was taught that decoders for x86 are less efficient and more power hungry than RISC ISAs because of their variable length instructions.

I remember being told (and it might be wrong) that ARM can decode multiple instructions in parallel because the CPU knows where the next instruction starts, but for x86, you'd have to decode the instructions in order.

discuss

pohuing|1 year ago

That seems to not matter much nowadays. There's another great(according to my untrained eye) writeup of the lack of importance on chips and cheese.

https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...

dzaima|1 year ago

The various mentioned power consumption amounts are 4-10% per-core, or 0.5-6% of package (with the caveat of running with micro-op cache off) for Zen 2, and 3-10% for Haswell. That's not massive, but is still far from what I'd consider insignificant; it could give leeway for an extra core or some improved ALUs; or, even, depending on the benchmark, is the difference between Zen 4 and Zen 5 (making the false assumption of a linear relation between power and performance, at least), which'd essentially be a "free" generational improvement. Of course the reality is gonna be more modest than that, but it's not nothing.

hajile|1 year ago

Clam makes some serious technical mistakes in that article and some info is outdated.

1. His claim that "ARM decoder is complex too" was wrong at the time (M1 being an obvious example) and has been proven more wrong since publication. ARM dropped the uop cache as soon as they dropped support for their very CISC-y 32-bit catastrophe. They bragged that this coincided with a whopping 75% reduction in decoder size for their A715 (while INCREASING from 4 decoders to 5) and this was almost single-handedly responsible for the reduced power consumption of that chip (as all the other changes were comparatively minor). NONE of the current-gen cores from ARM, Apple, or Qualcomm use uop cache eliminating these power-hungry cache and cache controllers.

2. The paper[0] he quotes has a stupid conclusion. They show integer workloads using a massive 22% of total core power on the decoder and even their fake float workload showed 8% of total core power. Realize that a study[1] of the entire Ubuntu package repo showed that just 12 int/ALU instructions made up 89% of all code with float/SIMD being in the very low single digits of use.

3. x86 decoder situation has gotten worse. Because adding extra decoders is exponentially complex, they decided to spend massive amounts of transistors on multiple decoder blocks working on various speculated branches. Setting aside that this penalizes unrolled code (where they may have just 3-4 decoders while modern ARM will have 10+ decoders), the setup for this is incredibly complex and man-year intensive.

4. "ARM decodes into uops too" is a false equivalency. The uops used by ARM are extremely close to the original instructions as shown by them being able to easily eliminate the uop cache. x86 has a much harder job here mapping a small set of instructions onto a large set.

5. "ARM is bloated too". ARM redid their entire ISA to eliminate bloat. If ISA didn't actually matter, why would they do this?

6. "RISC-V will become bloated too" is an appeal to ignorance. x86 has SEVENTEEN major SIMD extensions excluding the dozen or so AVX-512 extensions all with various incompatibilities and issues. This is because nobody knew what SIMD should look like. We know now and RISC-V won't be making that mistake. x86 has useless stuff like BCD instructions using up precious small instruction space because they didn't know. RISC-V won't do this either. With 50+ years of figuring the basics out, RISC-V won't be making any major mistakes on the most important stuff.

7. Omitting complexity. A bloated, ancient codebase takes forever to do anything with. A bloated, ancient ISA takes forever to do anything with. If ARM and Intel both put X dollars into a new CPU design, Intel is going to spend 20-30% or maybe even more of their budget on devs spending time chasing edge cases and testers to test al those edge cases. Meanwhile, ARM is going to spend that 20-30% of their budget on increasing performance. All other things equal, the ARM chip will be better at any given design price point.

8. Compilers matter. Spitting out fast x86 code is incredibly hard because there are so many variations on how to do things each with their own tradeoffs (that conflate in weird ways with the tradeoffs of nearby instructions). We do peephole heuristic optimizations because provably fast would take centuries. RISC-V and ARM both make it far easier for compiler writers because there's usually just one option rather than many options and that one option is going to be fast.

[0] https://www.usenix.org/system/files/conference/cooldc16/cool...

[1] https://oscarlab.github.io/papers/instrpop-systor19.pdf

anvuong|1 year ago

That was true when ARM was first released, but over the years the decoder for ARM has gotten more and more complicated. Who would have guessed adding more specialized instructions would result in more complicated decoders? ARM now uses multi-stage decoders, just the same as x86.