(no title)
etep | 8 years ago
Because a 4 KB page has 64 cache lines, then you can have at most 64 cache sets. With an 8 way associative cache this works out to 32 KB. Using 128 sets would cause aliasing, but with 64 sets the cache index is built from the LSBs that just index into the page (i.e. not used in the TLB lookup). Thus, the only way to grow increase L1 capacity is to: - totally abandon 4KB pages in favor of (e.g.) 2MB pages (not likely) - increase cache associativity (likely imo) - stop using virtual index+physical tag (not likely imo)
CalChris|8 years ago
Design Principle 2: Smaller is faster. [1]
BTW, if you look at Agner Fog's latency tables [2], mov mem,r (load) went from 3 cycles in Haswell to 2 cycles in Skylake. So Intel has been concentrating on faster which is nice.
And by way of comparison, AMD increased their μop cache size in Ryzen but then only slightly. Way size went from 6 μops to 8. This matches their increase in EUs.
[1] Patterson and Hennessy. Computer Organization and Design, 5th edition, p. 67.
[2] http://www.agner.org/optimize/instruction_tables.pdf
etep|8 years ago
Always impressed that Agner Fog takes the time to publish his results. Pretty amazing. But I think focusing your thinking on the register count in MIPs or the the uarch for some random opcode does not get into the real constraints on L1 cache design at all. One could say that x86 should be even faster, because hey, far less than 32 registers (or historically at least).
My response is like this: yes, the L1 has to be small to be fast, but it has been stuck at 32KB forever now. It could have grown! So it's not as simple as small is fast.
marcosdumay|8 years ago
fsaintjacques|8 years ago
On an unrelated subject, do you know if the next desktop generation (coffeelake?) will support AVX512 or should I just buy a skylake-x?
redcalx|8 years ago
https://www.kitguru.net/components/cpu/anton-shilov/intel-sk...
keldaris|8 years ago
Tuna-Fish|8 years ago
Increasing cache size grows latency in two ways: every doubling of the cache size adds one mux to the select path, and every doubling increases the wire delay from the most distant element by ~sqrt(2). Both of these are additively on the critical path, and spending more time on them would require increasing the cache latency.
The size of a cache is always a tradeoff against the latency of the cache. If this was not true, there would only be a single cache level that was both large and fast. However, making something both large and fast is impossible, so instead we have stacked cache levels starting with a very fast but small cache followed by increasingly slower and larger ones.
etep|8 years ago
gchadwick|8 years ago