I thought this was impressive, since it's basically single-stepping the code and disassembling each instruction on the fly to figure out the memory accesses it generates, in order to update a simulated cache and compute any misses.
It then integrates with the IDE to give you mouse-over reporting of a single statement's cache hit performance.
Also, the submission has been edited. I mentioned that the system is open source, http://github.com/insomniacgames/ig-cachesim is on the page I submitted, but is on page 105/105 of the slide PDF.
Edit, disclosure: I once had the pleasure of working with deplinenoise, and tried to sponge up as much knowledge as possible. He's great.
It is so amazing how quickly he was able to resolve this tool. 2 weeks to build a very specialized tool AND perform the optimizations on the code. What am I doing with my life?
On one hand: Yes, this is an extremely cool hack. Hats off.
On the other hand: It probably seems particularly impressive because its building blocks don't happen to be in your skill set. A rule of thumb: Something is impressive in proportion to how many times you think "How did they do that?!".
In this case, if you're not already familiar with disassembly, binary instrumentation, how caches work, etc, then every step along this way sounds magical. But to a low-level/systems programmer, this sort of thing is their bread and butter. So their reaction wouldn't be "OMG how did he even...?" - it's probably more like "Niiiiiiice".
(If you are, eg, a web developer, you probably have an equivalently ridiculous level of background knowledge about the internals of 5+ levels of the web stack, and how each goes wrong. I have seen systems guys - people who would jump on this project in a heartbeat - take one look at modern web dev and flee with their head in their hands, asking "how do they know all that stuff?!")
Unless I'm missing something he got approval for 2 weeks to dig in ... doesn't say how long the entire project took (typically something like this would be a part-time task while working on immediate team/game needs).
Fantastic work though and wonderful that insomniac have open sourced it.
It would be so awesome if we could simply springboard off the "wow, this is amazing..." factor of highly novel ideas we've never heard of before and leverage our awe to soak in new things at superspeed. Instead, we get overwhelmed and demotivated and think "I could never learn that" - which only serves to send is spiralling in the opposite direction!
Completely serious question/request: I'd be hugely interested in psychology hacks/ideas that short-circuit the brain (safely) to remedy this "thinking anti-pattern".
I think self-worth and confidence might be buried in the answer - I don't know if this is my own bias (depression?), I can't help but think that there's the tiniest edge of "I'm not worth learning this", which snowballs in a big way. Lack of confidence would be an obvious blocker though.
insomniac has been working on this kind of stuff for the last decade... the stuff they do is impressive, but don't think it just happened out of the blue...
I'm taking a computer architecture grad course right now, and coincidentally our first project involved writing a cache simulator. Obviously, the result was much more coarse-grained than this work.
For our project, we only needed to simulate cache accesses and not the content. We also kept track of the LRU block, simulated sub-blocking, and added a FIFO victim cache to the simulation. It was a fun exercise, and I got to learn a bit more C++ in the process.
The next project involves building a CPU pipeline simulator, which might be more challenging.
Having glanced at the code, it seems fairly easy to extend to modern Intel CPUs, which is great. The biggest caveat is the lack of prefetcher simulation, which literally makes many seemingly horrible cache misses a non-issue, but that's a non-trivial extension.
Regardless, very nice to see tools like this being open sourced. The games industry has a lot of valuable experience with performance-critical code and I'd love to see more of it shared publicly.
If you use this tool to find the outliers which just break the cache badly, then you can apply some human introspection on the data and figure out what is a problem and what is not.
Trying to add in a prefetch handler would be painful, time consuming, and may actually hide some cache misses that you would want to find. Also each CPU family and potentially in between families there could be changes to the prefetch algorithm used.
> The biggest caveat is the lack of prefetcher simulation
Certainly an important point, but only relevant if what you are doing had a hope of being prefetched in the first place. Maybe the tool could be extended to give an idea of how 'prefetchable' something is? This should be simpler than a full prefetcher simulation. Then you can concentrate on the cache misses it predicts that are unlikely to prefetched. The tool is more focussed on pointing out areas you can improve, it doesn't matter if it gets it wrong sometimes.
Is it bad to try to avoid relying on the pre-fetcher? It's obviously there for a good reason and presumably could cut down on a bunch of micro-optimisation, I'm just curious how much of a significant performance measuring black hole it's leaving by not considering it.
Coincidentally, I just watched this last night: it's a really accessible refresher if you're a bit outdated on how all these caches work and interact: https://youtu.be/OFgxAFdxYAQ
Great talk, I've been hoping to find something like this because CPU's seem like they're a dark-art these days. Also, not related, but Cliff Click's accent is delightful.
Fun fact: A certain game-dev kit had another machine sitting along-side on the bus. It enabled you to sample the Program Counter at N=1 for a small amount of samples(I want to say ~4k) off of a breakpoint.
The amount of perf stuff you could trivially track down was staggering. Load-Store-Fetch? Boom, there it is. Cache misses? Clear as day.
Never worked on another platform with such an incredible profiler and still miss it to this day. You could set N to any power of 2 for coarser profiles but it always shined at N=1.
A small note, callgrind doesn't necessarily have to be all or nothing.
You can use the CALLGRIND_START_INSTRUMENTATION and CALLGRIND_STOP_INSTRUMENTATION to start and stop the stats collecting. Start valgrind with the –instr-atstart=no option to disable collecting on everything outside of those blocks.
It's still really, really, slow but much faster than when it collects everything.
Isn't this pretty dependent upon the specific cache architecture? Definitely nowhere near an expert here, but I thought that CPUs all had varying sizes / layers of caches, on top of additional spare implementations - i.e., victim caches, etc. How much portability does this one-sized tool have, or is the variation between CPUs not an issue?
Yeah, you need to configure it to tell it what configuration of cache you want to simulate (there's an example of doing this for the Jaguar cache in slides 84/85). Valgrind's cachegrind tool also lets you tweak the cache config it simulates.
Some bugs are going to cause problems regardless of the cache config (like the "access a cold page unnecessarily every iteration of the loop" examples); some might be more sensitive to exact cache configuration.
I wish the slides went more into detail about what is meant by a "Jaguar Cache." It immediately made me think of [1] but clearly no one is writing games for it in 2017.
BTW, the "n" key jumps to the next slide without scrolling. (In Firefox's built-in PDF reader, anyways - I found that by accident when trying to make the space bar behave the way I wanted ;)
I don't think this can be called "sampling"; it's single-stepping and analysing each instruction.
There's nothing statistical about it, no "samples" that are hoped to represent some hidden full system, since the entire instruction stream is analyzed when the simulator is enabled.
[+] [-] unwind|9 years ago|reply
It then integrates with the IDE to give you mouse-over reporting of a single statement's cache hit performance.
Also, the submission has been edited. I mentioned that the system is open source, http://github.com/insomniacgames/ig-cachesim is on the page I submitted, but is on page 105/105 of the slide PDF.
Edit, disclosure: I once had the pleasure of working with deplinenoise, and tried to sponge up as much knowledge as possible. He's great.
[+] [-] CorvusCrypto|9 years ago|reply
[+] [-] meredydd|9 years ago|reply
On the other hand: It probably seems particularly impressive because its building blocks don't happen to be in your skill set. A rule of thumb: Something is impressive in proportion to how many times you think "How did they do that?!".
In this case, if you're not already familiar with disassembly, binary instrumentation, how caches work, etc, then every step along this way sounds magical. But to a low-level/systems programmer, this sort of thing is their bread and butter. So their reaction wouldn't be "OMG how did he even...?" - it's probably more like "Niiiiiiice".
(If you are, eg, a web developer, you probably have an equivalently ridiculous level of background knowledge about the internals of 5+ levels of the web stack, and how each goes wrong. I have seen systems guys - people who would jump on this project in a heartbeat - take one look at modern web dev and flee with their head in their hands, asking "how do they know all that stuff?!")
[+] [-] midnightclubbed|9 years ago|reply
[+] [-] i336_|9 years ago|reply
Completely serious question/request: I'd be hugely interested in psychology hacks/ideas that short-circuit the brain (safely) to remedy this "thinking anti-pattern".
I think self-worth and confidence might be buried in the answer - I don't know if this is my own bias (depression?), I can't help but think that there's the tiniest edge of "I'm not worth learning this", which snowballs in a big way. Lack of confidence would be an obvious blocker though.
[+] [-] angersock|9 years ago|reply
I'm not saying that programmers are getting worse--actually wait, no, that's what I'm saying (myself included).
[+] [-] xsmasher|9 years ago|reply
[+] [-] gravypod|9 years ago|reply
You work hard when your job is your hobby, passion, and profession.
[+] [-] papaver|9 years ago|reply
[+] [-] Cyph0n|9 years ago|reply
For our project, we only needed to simulate cache accesses and not the content. We also kept track of the LRU block, simulated sub-blocking, and added a FIFO victim cache to the simulation. It was a fun exercise, and I got to learn a bit more C++ in the process.
The next project involves building a CPU pipeline simulator, which might be more challenging.
[+] [-] keldaris|9 years ago|reply
Regardless, very nice to see tools like this being open sourced. The games industry has a lot of valuable experience with performance-critical code and I'd love to see more of it shared publicly.
[+] [-] daemin|9 years ago|reply
Trying to add in a prefetch handler would be painful, time consuming, and may actually hide some cache misses that you would want to find. Also each CPU family and potentially in between families there could be changes to the prefetch algorithm used.
[+] [-] gchadwick|9 years ago|reply
Certainly an important point, but only relevant if what you are doing had a hope of being prefetched in the first place. Maybe the tool could be extended to give an idea of how 'prefetchable' something is? This should be simpler than a full prefetcher simulation. Then you can concentrate on the cache misses it predicts that are unlikely to prefetched. The tool is more focussed on pointing out areas you can improve, it doesn't matter if it gets it wrong sometimes.
[+] [-] Twirrim|9 years ago|reply
[+] [-] zellyn|9 years ago|reply
[+] [-] zengid|9 years ago|reply
[+] [-] vvanders|9 years ago|reply
Fun fact: A certain game-dev kit had another machine sitting along-side on the bus. It enabled you to sample the Program Counter at N=1 for a small amount of samples(I want to say ~4k) off of a breakpoint.
The amount of perf stuff you could trivially track down was staggering. Load-Store-Fetch? Boom, there it is. Cache misses? Clear as day.
Never worked on another platform with such an incredible profiler and still miss it to this day. You could set N to any power of 2 for coarser profiles but it always shined at N=1.
[+] [-] ndh2|9 years ago|reply
[+] [-] CoolGuySteve|9 years ago|reply
You can use the CALLGRIND_START_INSTRUMENTATION and CALLGRIND_STOP_INSTRUMENTATION to start and stop the stats collecting. Start valgrind with the –instr-atstart=no option to disable collecting on everything outside of those blocks.
It's still really, really, slow but much faster than when it collects everything.
[+] [-] canteloops|9 years ago|reply
[+] [-] pm215|9 years ago|reply
Some bugs are going to cause problems regardless of the cache config (like the "access a cold page unnecessarily every iteration of the loop" examples); some might be more sensitive to exact cache configuration.
[+] [-] maccard|9 years ago|reply
[+] [-] eternalban|9 years ago|reply
[+] [-] csense|9 years ago|reply
[1] https://en.wikipedia.org/wiki/Atari_Jaguar
[+] [-] keldaris|9 years ago|reply
https://en.wikipedia.org/wiki/Jaguar_%28microarchitecture%29
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] nfriedly|9 years ago|reply
[+] [-] Narishma|9 years ago|reply
[+] [-] drudru11|9 years ago|reply
I am sure the next steps will be better visualization and simulator accuracy.
[+] [-] Aissen|9 years ago|reply
[+] [-] unwind|9 years ago|reply
There's nothing statistical about it, no "samples" that are hoped to represent some hidden full system, since the entire instruction stream is analyzed when the simulator is enabled.
[+] [-] btczeus|9 years ago|reply