(no title)
clamchowder | 1 year ago
Predecode/uop cache are both means to the same end, mitigating decode power. AMD and Intel have used both (though not on the same core). Arm has used both, including both on the same core for quite a few generations.
And a uop cache is just a cache. It's also big enough on current generations to cache more than just loops, to the point where it covers a majority of the instruction stream. Not sure where the misunderstanding of the uop cache "working as far in advance is possible" comes from. Unless you're talking about the BPU running ahead and prefetching into it? Which it does for L1i, and L2 as well?
2. "you can't run nothing but floats" they didn't do that in the paper, they did D += A[j] + B[j] ∗ C[j]. Something like matrix multiplication comes to mind, and that's not exactly a rare workload considering some ML stuff these days.
But also, has a study been done on Arm cores? For all we know they could spend similar power budgets on decode, or more. I could say an Arm core uses 99% of its power budget on decode, and be just as right as you are (they probably don't, my point is you don't have concrete data on both Arm and x86 decode power, which would be necessary for a productive discussion on the subject)
3. You're describing letting the BPU run ahead, which everyone has been doing for the past 15 years or so. Losing fetch bandwidth past a taken branch is a different thing.
4. Not sure where you're going. You started by suggesting Arm has less micro-op expansion than x86, and I provided a counterexample. Now you're talking about avoiding complex instructions, which a) compilers do on both architectures, they'll avoid stuff like division, and b) humans don't in cases where complex instructions are beneficial, see Linux kernel using rep movsb (https://github.com/torvalds/linux/blob/5189dafa4cf950e675f02...), and Arm introducing similar complex instructions (https://community.arm.com/arm-community-blogs/b/architecture...)
Also "complex" x86 instructions aren't avoided in the video encoding workload. On x86 it takes ~16.5T instructions to finish the workload, and ~19.9T on Arm (and ~23.8T micro-ops on Neoverse V2). If "complex" means more work per instruction, then x86 used more complex instructions, right?
8. You can use a variable length NOP on x86, or multiple NOPs on Arm to align function calls to cacheline boundaries. What's the difference? Isn't the latter worse if you need to move by more than 4 bytes, since you have multiple NOPs (and thus multiple uops, which you think is the case but isn't always true, as some x86 and some Arm CPUs can fuse NOP pairs)
But seriously, do try gathering some data to see if cacheline alignment matters. A lot of x86/Arm cores that do micro-op caching don't seem to care if a function (or branch target) is aligned to the start of a cacheline. Golden Cove's return predictor does appear to track targets at cacheline granularity, but that's a special case. Earlier Intel and pretty much all AMD cores don't seem to care, nor do the Arm ones I've tested.
Anyway, you're making a lot of unsubstantiated guesses on "weirdness" without anything to suggest it has any effect. I don't think this is the right approach. Instead of "tail wagging the dog" or whatever, I suggest a data-based approach where you conduct experiments on some x86/Arm CPUs, and analyze some x86/Arm programs. I guess the analogy is, tell the dog to do something and see how it behaves? Then draw conclusions off that?
hajile|1 year ago
L1 cache is "free" in that you can fill it with simple data moves. uop cache requires actual work to decode and store elements for use in addition to moving the data. As to working ahead, you already covered this yourself. If you have a nearly 1-to-1 instruction-to-uop ratio, having just 4 decoders (eg, zen4) is a problem because you can execute a lot more than just 4 instructions on the backend. 6-wide Zen4 means you use 50% more instructions than you decode per clock. You make up for this in loops, but that means while you're executing your current loop, you must be maxing out the decoders to speculatively fill the rest of the uop cache before the loop finishes. If the loop finishes and you don't have the next bunch of instructions decoded, you have a multi-cycle delay coming down the pipeline.
2. I'd LOVE to see a similar study of current ARM chips, but I think the answer here is pretty simple to deduce. ARM's slide says "4x smaller decoders vs A710" despite adding a 5th decoder. They claim 20% reduction in power at the same performance and the biggest change is the decoder. As x86 decode is absolutely more complex than aarch32, we can only deduce that switching from x86 to aarch64 would be an even more massive reduction. If we assume an identical 75% reduction in decoder power, we'd move from 4.8w on haswell the decoder down to 1.2w reducing total core power from 22.1 to 18.5 or a ~16% overall reduction in power. This isn't too far from to the power numbers claimed by ARM.
4. This was a tangent. I was talking about uops rather than the ISA. Intel claims to be simple RISC internally just like ARM, but if Intel is using nearly 30% fewer uops to do the same work, their "RISC" backend is way more complex than they're admitting.
8. I believe aligning functions to cacheline boundaries is a default flag at higher optimization levels. I'm pretty sure that they did the analysis before enabling this by default. x86 NOP flexibility is superior to ARM (as is its ability to avoid them entirely), but the cause is the weirdness of the x86 ISA and I think it's an overall net negative.
Loads of x86 instructions are microcode only. Use one and it'll be thousands of cycles. They remain in microcode because nobody uses them, so why even try to optimize and they aren't used because they are dog slow. How would you collect data about this? Nothing will ever change unless someone pours in millions of dollars in man-hours into attempting to speed it up, but why would anyone want to do that?
Optimizing for a local maxima rather than a global maxima happens all over technology and it happens exactly because of the data-driven approach you are talking about. Look for the hot code and optimize it without regard that there may be a better architecture you could be using instead. Many successes relied on an intuitive hunch.
ISA history has a ton of examples. iAPX432 super-CISC, the RISC movement, branch delay slots, register windows, EPIC/VLIW, Bulldozer's CMT, or even the Mill design. All of these were attempts to find new maxima with greater or lesser degrees of success. When you look into these, pretty much NONE of them had any real data to drive them because there wasn't any data until they'd actually started work.
clamchowder|1 year ago
Arm's predecoded L1i cache is not "free" and can't be filled with simple data moves. You need predecode logic to translate raw instruction bytes into an intermediate format. If Arm expanded predecode to handle fusion cases in A715, that predecode logic is likely more complex than in proir generations.
2. Size/area is different from power consumption. Also the decoder is far from the only change. The BTBs were changed from 2 to 3 level, and that can help efficiency (could make a smaller L2 BTB with similar latency, while a slower third level keeps capacity up). TLBs are bigger, probably reducing page walks. Remember page walks are memory accesses and the paper earlier showed data transfers count for a large percentage of dynamic power.
4. IMO no one is really RISC or CISC these days
8. Sure you can align the function or not. I don't think it matters except in rare corner cases on very old cores. Not sure why you think it's an overall net negative. "feeling weird" does not make for solid analysis.
Most x86 instructions are not microcode only. Again, check your data with performance counters. Microcoded instructions are in the extreme minority. Maybe microcoded instructions were more common in 1978 with the 8086, but a few things have changed between then and now. Also microcoded instructions do not cost thousands of cycles, have you checked? i.e. a gather is ~22 micro ops on Haswell, from https://uops.info/table.html Golden Cove does it in 5-7 uops.
ISA history has a lot of failed examples where people tried to lean on the ISA to simplify the core architecture. EPIC/VLIW, branch delay slots, and register windows have all died off. Mill is a dumb idea and never went anywhere. Everyone has converged on big OoO machines for a reason, even though doing OoO execution is really complex.
If you're interested in cases where ISA does matter, look at GPUs. VLIW had some success there (AMD Terascale, the HD 2xxx to 6xxx generations). Static instruction scheduling is used in Nvidia GPUs since Kepler. In CPUs ISA really doesn't matter unless you do something that actively makes an OoO implementation harder, like register windows or predication.