top | item 14506596

(no title)

etep | 8 years ago

I had wondered, why are the L1 caches not growing, while L2 and L3 capacities continue to grow: a significant limitation on L1 cache size is actually the fixed tradeoff between associativity, page size (i.e. the 4KB pages allocated by the OS to processes).

Because a 4 KB page has 64 cache lines, then you can have at most 64 cache sets. With an 8 way associative cache this works out to 32 KB. Using 128 sets would cause aliasing, but with 64 sets the cache index is built from the LSBs that just index into the page (i.e. not used in the TLB lookup). Thus, the only way to grow increase L1 capacity is to: - totally abandon 4KB pages in favor of (e.g.) 2MB pages (not likely) - increase cache associativity (likely imo) - stop using virtual index+physical tag (not likely imo)

discuss

CalChris|8 years ago

I think a simpler argument is that for L1 you want fast, not big. Same thing with registers (a form of cache at a lower level). Why did MIPS only have 32 registers?

Design Principle 2: Smaller is faster. [1]

BTW, if you look at Agner Fog's latency tables [2], mov mem,r (load) went from 3 cycles in Haswell to 2 cycles in Skylake. So Intel has been concentrating on faster which is nice.

And by way of comparison, AMD increased their μop cache size in Ryzen but then only slightly. Way size went from 6 μops to 8. This matches their increase in EUs.

[1] Patterson and Hennessy. Computer Organization and Design, 5th edition, p. 67.

[2] http://www.agner.org/optimize/instruction_tables.pdf

etep|8 years ago

At a high level it's true that smaller is faster, but it's also true that those L1s could have grown by adding sets (not ways) and achieved the same latency. L2 has grown, but stayed iso-latency. This seems to say that "smaller is faster" does not always hold.

Always impressed that Agner Fog takes the time to publish his results. Pretty amazing. But I think focusing your thinking on the register count in MIPs or the the uarch for some random opcode does not get into the real constraints on L1 cache design at all. One could say that x86 should be even faster, because hey, far less than 32 registers (or historically at least).

My response is like this: yes, the L1 has to be small to be fast, but it has been stuck at 32KB forever now. It could have grown! So it's not as simple as small is fast.

marcosdumay|8 years ago

Increasing the register count spends opcodes. That leads to less available instructions, or at a minimum constraints opcode optimization.

fsaintjacques|8 years ago

It seems you know this subject very well :) https://github.com/etep/resume

On an unrelated subject, do you know if the next desktop generation (coffeelake?) will support AVX512 or should I just buy a skylake-x?

redcalx|8 years ago

Yes (partially). See the table in this article:

https://www.kitguru.net/components/cpu/anton-shilov/intel-sk...

keldaris|8 years ago

Since I don't see any flaw in your reasoning here, the obvious question becomes - why haven't they just moved to 16-way associative L1 yet? What are the hurdles?

Tuna-Fish|8 years ago

The reason he gave is not the main reason L1s don't grow. The main reason is latency.

Increasing cache size grows latency in two ways: every doubling of the cache size adds one mux to the select path, and every doubling increases the wire delay from the most distant element by ~sqrt(2). Both of these are additively on the critical path, and spending more time on them would require increasing the cache latency.

The size of a cache is always a tradeoff against the latency of the cache. If this was not true, there would only be a single cache level that was both large and fast. However, making something both large and fast is impossible, so instead we have stacked cache levels starting with a very fast but small cache followed by increasingly slower and larger ones.

etep|8 years ago

If it was easy to build high performance caches with high associativity, we would certainly see higher associativity. Ideally, you want a fully associative cache, but it's too expensive. In CPU caches, once a set is selected, all N associative ways are compared simultaneously. So growing associativity costs area and power for extra comparators. This growth could cause timing issues, i.e. add latency to the memory access or cause CPU freq. to be lowered.

gchadwick|8 years ago

You can also build a memory system that is able to deal with the aliasing. Then you don't have the dependence on page size.