Same as the failure of Itanium VLIW instructions: you don't actually want to force the decision of what is in the cache back to compile time, when the relevant information is better available at runtime.
Also, additional information on instructions costs instruction bandwidth and I-cache.
> you don't actually want to force the decision of what is in the cache back to compile time, when the relevant information is better available at runtime
That is very context-dependent. In high-performance code having explicit control over caches can be very beneficial. CUDA and similar give you that ability and it is used extensively.
Now, for general "I wrote some code and want the hardware to run it fast with little effort from my side", I agree that transparent caches are the way.
That seems correct, but it also doesn’t account for managed languages with runtimes like JavaScript or Java or .NET, which probably have a lot of interesting runtime info they could use to influence caching behavior. There’s an amount of “who caches the cacher” if you go down this path (who manages cache lines for the V8 native code that is in turn managing cache lines for jitted JavaScript code), but it still seems like there is opportunity there?
thats a strange statement. its certainly not black and white, but the compiler has explicit lifetime information, while the cache infrastructure is using heuristics. I worked on a project which supported region tags in the cache for compiler-directed allocation and it showed some decent gains (in simulation).
I guess this is one place where it seems possible to allow for compiler annotations without disabling the default heuristics so you could maybe get the best of both.
There are cache control instructions already. The reason why it goes no further than prefetch/invalidate hints is probably because exposing a fuller api on the chip level to control the cache would overcomplicate designs, not be backwards compatible/stable api. Treating the cache as ram would also require a controller, which then also needs to receive instructions, or the cpu has to suddenly manage the cache itself.
I can understand why they just decide to bake the cache algorithms into hardware, validate it and be done with it. Id love if a hardware engineer or more well-read fellow could chime in.
Because programmers are in general worse at managing them than the basic LRU algorithm.
And because the abstraction is simple and easy enough to understand that when you do need close control, it's easy to achieve by just writing to the abstraction. Careful control of data layout and nontemporal instructions are almost always all you need.
There has! Intel has Cache Acceleration Technology, and I was very peripherally involved in reviewing research projects at Boston University into this. One that I remember was allowing the operating system to divide up cache and memory bandwidth for better prioritization.
This is not applicable to most programming scenarios since the cache gets trashed unpredictably during context switches (including the user-level task switches involved in cooperative async patterns). It's not a true scratchpad storage, and turning it into one would slow down context switches a lot since the scratchpad would be processor state. Maybe this can be revisited once even low-end computers have so many hardware cores/threads that context switches become so rare that the overhead is not a big deal. But we are very far from anything of the sort.
I would say this is the main benefit of cuda programming on gpu. You get to control local memory. Maybe nvidia will bring it to the cpu now that the make CPU’s
pjc50|3 days ago
Also, additional information on instructions costs instruction bandwidth and I-cache.
david-gpu|3 days ago
That is very context-dependent. In high-performance code having explicit control over caches can be very beneficial. CUDA and similar give you that ability and it is used extensively.
Now, for general "I wrote some code and want the hardware to run it fast with little effort from my side", I agree that transparent caches are the way.
oldmanhorton|3 days ago
convolvatron|3 days ago
I guess this is one place where it seems possible to allow for compiler annotations without disabling the default heuristics so you could maybe get the best of both.
greybcg|4 days ago
I can understand why they just decide to bake the cache algorithms into hardware, validate it and be done with it. Id love if a hardware engineer or more well-read fellow could chime in.
Someone|3 days ago
Tuna-Fish|3 days ago
And because the abstraction is simple and easy enough to understand that when you do need close control, it's easy to achieve by just writing to the abstraction. Careful control of data layout and nontemporal instructions are almost always all you need.
rwmj|4 days ago
https://www.intel.com/content/www/us/en/developer/articles/t...
zozbot234|3 days ago
naveen99|2 days ago