Most people focus on Intel's P6 derived line of uarches (Golden Cove, Redwood Cove), because those are the cores with the highest performance.
But I think the atom derived "mont" line (Gracemont, Crestmont) is much more interesting, because it's where intel is innovating and experimenting with new approaches.
I suspect Intel is planning to drop their P-Core line entirely in the near future. If you look at the IPC numbers, Gracemont is actually roughly equal to Golden Cove on integer workloads, and it's quite a bit smaller. If intel widened the FPU to 256bit, their "mont" cores would probably get roughly equal IPC on FPU workloads too.
Importantly "mont" uarch has one major advantage over the "cove" uarch, and that's the clustered instruction decoding approach. Golden Cove finally managed to move to a 6-wide instruction decoder after being stuck with 4-wide instruction decoders for decades. And it can only sustain decode 6 instructions per cycle if there are no more than one complex instruction every 6 instructions. The uop cache goes a long way to compensating for this, but that takes up a lot of silicon.
And now Crestmont has perfected the approach of combining the instruction streams from two independent 3-wide instruction decoders. It can match the 6 instructions per cycle peeks of the coves, but with much simpler decoders. And because they are independent, it can handle one complex instruction every 6 instructions. It doesn't even need a uop cache.
The best part is that it's scalable. There is absolutely nothing stopping intel from adding a third instruction decode cluster to reach 9 instructions per cycle. Or a Fourth cluster. Or a Fifth...
There is a planned line of Xeons with a lot of *mont cores, targeting workloads like web hosting. It makes sense to not use the huge cores we now get from Intel and AMD for most workloads.
It wouldn't be the first time Intel dropped a high performance consumer architecture and started upcycling their lower power designs instead. Netburst ran hot and slow, and the Core microarchitecture which replaced it was derived from their mobile designs instead.
Cove cores are huge and eat a lot of power and die area, which has hindered Intel significantly ever since falling behind in the foundry race. AMD's Zen architectures have been efficient with silicon, especially with the new c variants.
> Gracemont is actually roughly equal to Golden Cove on integer workloads, and it's quite a bit smaller. If intel widened the FPU to 256bit, their "mont" cores would probably get roughly equal IPC on FPU workloads too.
How much smaller would their 'core' cores be if they optimized them for a low max clock? Zen4c is roughly half the size and it's nearly the same as Zen4, just with a low max clock (and a tweaked cache)
I wonder if there are hints about "rentable units" in here. There are rumors that a future module can act as a single really wide core or two moderate cores.
[+] [-] phire|1 year ago|reply
But I think the atom derived "mont" line (Gracemont, Crestmont) is much more interesting, because it's where intel is innovating and experimenting with new approaches.
I suspect Intel is planning to drop their P-Core line entirely in the near future. If you look at the IPC numbers, Gracemont is actually roughly equal to Golden Cove on integer workloads, and it's quite a bit smaller. If intel widened the FPU to 256bit, their "mont" cores would probably get roughly equal IPC on FPU workloads too.
Importantly "mont" uarch has one major advantage over the "cove" uarch, and that's the clustered instruction decoding approach. Golden Cove finally managed to move to a 6-wide instruction decoder after being stuck with 4-wide instruction decoders for decades. And it can only sustain decode 6 instructions per cycle if there are no more than one complex instruction every 6 instructions. The uop cache goes a long way to compensating for this, but that takes up a lot of silicon.
And now Crestmont has perfected the approach of combining the instruction streams from two independent 3-wide instruction decoders. It can match the 6 instructions per cycle peeks of the coves, but with much simpler decoders. And because they are independent, it can handle one complex instruction every 6 instructions. It doesn't even need a uop cache.
The best part is that it's scalable. There is absolutely nothing stopping intel from adding a third instruction decode cluster to reach 9 instructions per cycle. Or a Fourth cluster. Or a Fifth...
[+] [-] pclmulqdq|1 year ago|reply
[+] [-] mmaniac|1 year ago|reply
Cove cores are huge and eat a lot of power and die area, which has hindered Intel significantly ever since falling behind in the foundry race. AMD's Zen architectures have been efficient with silicon, especially with the new c variants.
[+] [-] toast0|1 year ago|reply
How much smaller would their 'core' cores be if they optimized them for a low max clock? Zen4c is roughly half the size and it's nearly the same as Zen4, just with a low max clock (and a tweaked cache)
[+] [-] wmf|1 year ago|reply