For those interested about RAM timings, effects of cache, and other low level memory topics explained exceptionally well, check out "What Every Programmer Should Know About Memory"[1]
That book is detailed, but that's also the problem with it. The relevant signal is all but drowned in the noise of 114 pages of details of CAS/RAS, circuit diagrams for SRAM and DRAM cells, bus protocols for hardware transactional memory, descriptions of Linux APIs for configuring NUMA support, etc.; frankly, things that need not be known unless dealing with those issues specifically.
In the article of this post, on the other hand, a handful of key characteristics of the cache hierarchy are graphically displayed using pretty minimal code snippets and attractive graphs. I believe programmers would generally be better served by the intuitions gained from the blog post than daunted by the exhaustive detail and verbose presentation of Ulrich's book.
IOW, keep in mind simple things like: measure; be aware of the cheapness of adjacent accesses, of the possibility of false sharing with certain strides, of the cost of following pointers, of the pitfalls of concurrent access to adjacent memory locations, and even serial access to the same memory location (dependencies wrt ILP). Concentrate on building a mental model for providing intuitions of cache behaviour at the appropriate level of abstraction, but always be prepared to measure to verify intuitions, as the hardware changes all the time, and different architectures can give rise to surprisingly different results.
Has anyone run similar tests on the JVM after warmup?
Theoretically the JVM should be pushing memory around to negate as much of the cache penalty as possible, and I've definitely noticed this in some algorithms (this is how people come up with code where Java outperforms almost identical code in pure C, and I've seen 3:1 speed differences in favor of the JVM), but I'd be curious to see the graphs.
Maybe I'll do it later today if I get a few minutes...
That seems very unlikely. Effects like the ones in the article (which is great, btw) can't be "fixed" simply by moving the data. In the first example, there's simply no way for any runtime environment to "know" at creation time that you would be walking the array in 16 byte strides. And if it tried to reorder the data dynamically, it would obviously have to touch all that memory, thus invalidating whatever caching could be done anyway.
Where JVMs beat naive C, it's almost always in the locking and synchronization code. See the Sun paper on Biased Locking for an example of the kind of tricks you can do by (for example) "guessing" that newly locked objects are only ever going to be referenced by one thread.
But memory is memory. The JVMs view of it lacks a raw pointer, but is otherwise pretty much the same as C's.
Nice summary. His "example 6", about false cache sharing, bit me once when I was allocating data structures for my threads and they ended up close enough in memory that threads shared cache lines. It took a while before I figured it out and probably the only reason I did was that it didn't happen all the time, so I was wondering why the code sometimes ran slower. If it had been consistent I would just have thought the code was slow. While there are profilers that spit out cache hit rates, it's usually on the scale of whole threads which makes it very difficult to find which code is bad. You also have to have enough experience to know what a "normal" hit rate is, which I'm not sure I am.
Yes, but be careful with this if your algorithms knowledge comes from a traditional algorithms course. These are often taught as if computational complexity was the only concern. Algorithms that minimize space complexity are often more of an issue.
So while the choice of algorithm is a large issue, one must be careful to select an algorithm that makes the appropriate space vs. computation tradeoff. That knowledge is often guided by an awareness of where the systems bottlenecks are.
Once you've chosen the correct algorithm and you're down to optimizing code, which is what this article discusses, understanding cache effects is critical and this article provides a nice view of the basics. The usual rules about not doing optimization unless one needs to applies, but I don't see why that always merits an algorithms junkie popping out of the woodwork to essentially say the same thing everyone always says: premature optimization is dumb.
But choice of algorithm can often mean cache-awareness, too - using a random access algorithm where a sequential one could be used instead can be a 10x performance drain, and a 10x constant multiplier is pretty substantial.
For a concrete example, a lot of the most important optimizations in physics simulations involve satisfying the simultaneous goals of using cache-friendly and parallel algorithms. In some cases people are even willing to accept worse big O behavior because the "worse" algorithms use a lot of sequential access and parallelize easier than the sophisticated ones, more than making up the difference for the object counts that come up in practice.
For example, the algorithmic purist may insist on using an O(1) or at least O(log n) container for maps or sets, where in practice, if the container will never hold more than a handful of items, a linear search through an array may be faster - because of cache effects.
Similarly, an overemphasis of pure OO design can result in lots of fine-grained objects stored in the heap, with lots of edges needing to be traversed to accomplish productive work. If one isn't aware of how expensive following pointers can end up owing to cache effects, you can create designs that lose substantial chunks of performance in such a way that it's not obvious with e.g. a simple code-based profiler where the bottleneck is..
You're explaining that poorly, I think. As written, that's only true if 99% of all software is written using poor algorithm choices. Maybe that's true in some regimes, but there is still a bunch of performance-sensitive software written in C/C++ by developers who make good algorithm and data structure choices.
And so when these people get to looking at performance details, they're looking at constant factor improvements. And as described in the article and elsewhere, the constant factors due to cache effects are often on the order of, say, 6 or 10. The penalty to poor cache design can be huge, and it's important to know how they work to avoid that.
[+] [-] yan|16 years ago|reply
[1] http://people.redhat.com/drepper/cpumemory.pdf
[+] [-] barrkel|16 years ago|reply
In the article of this post, on the other hand, a handful of key characteristics of the cache hierarchy are graphically displayed using pretty minimal code snippets and attractive graphs. I believe programmers would generally be better served by the intuitions gained from the blog post than daunted by the exhaustive detail and verbose presentation of Ulrich's book.
IOW, keep in mind simple things like: measure; be aware of the cheapness of adjacent accesses, of the possibility of false sharing with certain strides, of the cost of following pointers, of the pitfalls of concurrent access to adjacent memory locations, and even serial access to the same memory location (dependencies wrt ILP). Concentrate on building a mental model for providing intuitions of cache behaviour at the appropriate level of abstraction, but always be prepared to measure to verify intuitions, as the hardware changes all the time, and different architectures can give rise to surprisingly different results.
[+] [-] ewjordan|16 years ago|reply
Theoretically the JVM should be pushing memory around to negate as much of the cache penalty as possible, and I've definitely noticed this in some algorithms (this is how people come up with code where Java outperforms almost identical code in pure C, and I've seen 3:1 speed differences in favor of the JVM), but I'd be curious to see the graphs.
Maybe I'll do it later today if I get a few minutes...
[+] [-] ajross|16 years ago|reply
Where JVMs beat naive C, it's almost always in the locking and synchronization code. See the Sun paper on Biased Locking for an example of the kind of tricks you can do by (for example) "guessing" that newly locked objects are only ever going to be referenced by one thread.
But memory is memory. The JVMs view of it lacks a raw pointer, but is otherwise pretty much the same as C's.
[+] [-] lutorm|16 years ago|reply
[+] [-] RiderOfGiraffes|16 years ago|reply
http://news.ycombinator.com/item?id=1032528
[+] [-] wingo|16 years ago|reply
[+] [-] palish|16 years ago|reply
[+] [-] djcapelis|16 years ago|reply
So while the choice of algorithm is a large issue, one must be careful to select an algorithm that makes the appropriate space vs. computation tradeoff. That knowledge is often guided by an awareness of where the systems bottlenecks are.
Once you've chosen the correct algorithm and you're down to optimizing code, which is what this article discusses, understanding cache effects is critical and this article provides a nice view of the basics. The usual rules about not doing optimization unless one needs to applies, but I don't see why that always merits an algorithms junkie popping out of the woodwork to essentially say the same thing everyone always says: premature optimization is dumb.
[+] [-] ewjordan|16 years ago|reply
But choice of algorithm can often mean cache-awareness, too - using a random access algorithm where a sequential one could be used instead can be a 10x performance drain, and a 10x constant multiplier is pretty substantial.
For a concrete example, a lot of the most important optimizations in physics simulations involve satisfying the simultaneous goals of using cache-friendly and parallel algorithms. In some cases people are even willing to accept worse big O behavior because the "worse" algorithms use a lot of sequential access and parallelize easier than the sophisticated ones, more than making up the difference for the object counts that come up in practice.
[+] [-] barrkel|16 years ago|reply
For example, the algorithmic purist may insist on using an O(1) or at least O(log n) container for maps or sets, where in practice, if the container will never hold more than a handful of items, a linear search through an array may be faster - because of cache effects.
Similarly, an overemphasis of pure OO design can result in lots of fine-grained objects stored in the heap, with lots of edges needing to be traversed to accomplish productive work. If one isn't aware of how expensive following pointers can end up owing to cache effects, you can create designs that lose substantial chunks of performance in such a way that it's not obvious with e.g. a simple code-based profiler where the bottleneck is..
[+] [-] ajross|16 years ago|reply
And so when these people get to looking at performance details, they're looking at constant factor improvements. And as described in the article and elsewhere, the constant factors due to cache effects are often on the order of, say, 6 or 10. The penalty to poor cache design can be huge, and it's important to know how they work to avoid that.
[+] [-] lallysingh|16 years ago|reply