(no title)
sm_1024 | 1 year ago
I remember being told (and it might be wrong) that ARM can decode multiple instructions in parallel because the CPU knows where the next instruction starts, but for x86, you'd have to decode the instructions in order.
sm_1024 | 1 year ago
I remember being told (and it might be wrong) that ARM can decode multiple instructions in parallel because the CPU knows where the next instruction starts, but for x86, you'd have to decode the instructions in order.
pohuing|1 year ago
https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...
dzaima|1 year ago
hajile|1 year ago
1. His claim that "ARM decoder is complex too" was wrong at the time (M1 being an obvious example) and has been proven more wrong since publication. ARM dropped the uop cache as soon as they dropped support for their very CISC-y 32-bit catastrophe. They bragged that this coincided with a whopping 75% reduction in decoder size for their A715 (while INCREASING from 4 decoders to 5) and this was almost single-handedly responsible for the reduced power consumption of that chip (as all the other changes were comparatively minor). NONE of the current-gen cores from ARM, Apple, or Qualcomm use uop cache eliminating these power-hungry cache and cache controllers.
2. The paper[0] he quotes has a stupid conclusion. They show integer workloads using a massive 22% of total core power on the decoder and even their fake float workload showed 8% of total core power. Realize that a study[1] of the entire Ubuntu package repo showed that just 12 int/ALU instructions made up 89% of all code with float/SIMD being in the very low single digits of use.
3. x86 decoder situation has gotten worse. Because adding extra decoders is exponentially complex, they decided to spend massive amounts of transistors on multiple decoder blocks working on various speculated branches. Setting aside that this penalizes unrolled code (where they may have just 3-4 decoders while modern ARM will have 10+ decoders), the setup for this is incredibly complex and man-year intensive.
4. "ARM decodes into uops too" is a false equivalency. The uops used by ARM are extremely close to the original instructions as shown by them being able to easily eliminate the uop cache. x86 has a much harder job here mapping a small set of instructions onto a large set.
5. "ARM is bloated too". ARM redid their entire ISA to eliminate bloat. If ISA didn't actually matter, why would they do this?
6. "RISC-V will become bloated too" is an appeal to ignorance. x86 has SEVENTEEN major SIMD extensions excluding the dozen or so AVX-512 extensions all with various incompatibilities and issues. This is because nobody knew what SIMD should look like. We know now and RISC-V won't be making that mistake. x86 has useless stuff like BCD instructions using up precious small instruction space because they didn't know. RISC-V won't do this either. With 50+ years of figuring the basics out, RISC-V won't be making any major mistakes on the most important stuff.
7. Omitting complexity. A bloated, ancient codebase takes forever to do anything with. A bloated, ancient ISA takes forever to do anything with. If ARM and Intel both put X dollars into a new CPU design, Intel is going to spend 20-30% or maybe even more of their budget on devs spending time chasing edge cases and testers to test al those edge cases. Meanwhile, ARM is going to spend that 20-30% of their budget on increasing performance. All other things equal, the ARM chip will be better at any given design price point.
8. Compilers matter. Spitting out fast x86 code is incredibly hard because there are so many variations on how to do things each with their own tradeoffs (that conflate in weird ways with the tradeoffs of nearby instructions). We do peephole heuristic optimizations because provably fast would take centuries. RISC-V and ARM both make it far easier for compiler writers because there's usually just one option rather than many options and that one option is going to be fast.
[0] https://www.usenix.org/system/files/conference/cooldc16/cool...
[1] https://oscarlab.github.io/papers/instrpop-systor19.pdf
anvuong|1 year ago