top | item 16753218

(no title)

Tobba_ | 8 years ago

It'll change if someone can manage to take the "central" out of the CPU internals. You don't necessarily need software to see anything other than a monolithic core, but having to plumb everything through one central execution unit is hugely inefficient, if anything due to the latency involved. For example, if you're performing an indirect load and hit DRAM while loading the pointer, that result has to be brought into the core, then all way back to the memory controller the same way it came. So far that's just been worked around by throwing in bigger and bigger caches, but the size of first-level cache is at a dead end for now (due to needing physical proximity).

Heck, current x86 chips could be juiced quite a bit if you could take out the requirement for backwards compatibility. Instruction encoding being the obvious thing (not that it's not hip and RISC, but that it's an absolute mess that a huge proportion of the chips power has to be wasted on, and is pretty space-inefficient due to how horribly allocated things are). Less obviously just removing things like the data stack instructions (which, at least on Intel, have a dedicated "stack engine" to optimize them), the ability to read/write instruction memory directly (creates a mess of self-modifying code detection to maintain correct behaviour, and complicates L1 cache coherency a bit). Trimming transistors reduces the power consumption, which in turn means you can raise the voltage without the chip melting, and can clear up space in your critical data path.

discuss

order

gpderetta|8 years ago

on an high end x86, decoder takes only a tiny proportion of the area and a power budget.

On smaller low power cpus it is more significant of course.

The stack engine is necessary anyway even if you have no specific stack instructions, as it removes the dependence of the top of stack manipulation from local variable accesses which is critical. Explicit stack manipulation instructions might actually make the stack engine simpler.

Coherent instruction cache and pipeline are super relevant in this age of pervasive self modifying code (a.k.a JIT).

Modern CPUs are complex for a reason.

ryanpetrich|8 years ago

It's relatively straightforward for self modifying code to manually flush instruction caches when necessary and JIT compilers that target other architectures already satisfy this requirement. Only backwards compatibility with existing x86 software requires a coherent instruction cache.

Tuna-Fish|8 years ago

Processing in memory has real promise for the cases where your work can be distributed. Specifically, I think it can have a great future in AI. However, for general purpose code I doubt it can do anything. Your example of indirect load would be greatly sped up if the target of the pointer is on the same device as the pointer. However, the second it isn't, the speed of moving things from one ram chip to another isn't any faster than from ram chip to cpu, and at that point defining a single central location that tries to be close to everything just makes sense. If your operation needs 8 values from 8 different places, having a central location means doing 8 transfers, while PIM can mean forwarding each value/intermediate values multiple times to go the the next location.

None of the changes to x86 people have thought of over the years really helps enough to break backcompat. Simply because they aren't on the fast path on the critical execution stage. The limit imposed on frequency by power in current cpus is not really the total amount of power consumed, it's the amount of power consumed in the <0.25mm of chip that houses the register file, forwarding network and alus. That is, the place were things actually happen during the most important pipeline stage. This is why a 8-core cpu running just a single thread cannot make one of the cores consume as much power as all the 8 would if running 8 threads -- the register file of the running core would just melt, even if the total power would stay below chip limits.

x86 decoding is hairy and takes a long time and a lot of transistors. However, it is placed in it's own pipeline stages, that are ran parallel to the execute and only slow it down by making a branch miss a little more expensive. And the power is limited today by caching the decoded uops in their own cache, so during any tight loop, the decode hardware is idle and consumes no power. The same sort of goes for the stack engine -- as it runs early in the pipeline, it is basically a way to compress instructions a little that saves power by making code more compact when it is running, and does nothing when it is not used. Removing it would not really help, even if all code instantly changed to accommodate. Much of the rest of the ugly warts of the x86 architecture is handled in the time-honored CISC way: just punt it to microcode, performance be damned. Today, self-modifying code technically works, but you never want to do it because invalidating lines in the L1i has been implemented in the way that is the fastest and cheapest way to make the common case of code that does not modify itself. (And which has to exists even if you don't support self-modifying code, because there has to be some way of invalidating L1i entries.) Similarly, a lot of the CISC instructions that make more sense to implement as software routines (fpu sin/cos for example) are today just abandoned ucode routines that are slower than rolling your own.

Tobba_|8 years ago

I'm not talking about the fundamentally misguided memory-distributed computing stuff, I mean "improve flexibility enough that you can bolt some additional units on as offload" (address translation in this case would take some work though). The magic of presenting software with a more or less monolithic core in this case is that you don't have that problem, since you can simply do it the usual way.

Also, I don't think the trouble with added complexity out of the hot path is any added latency, it's that they're needlessly burning up the thermal budget. Not that raising the voltage is the best way of increasing frequency, but it's sure to do so.