One thing I've seen is that a weighted LRU is usually slightly better for unpredictable workloads. i.e. it will never evict high use cache data when there's a flurry of one-off traffic that would normally start evicting a normal LRU cache, and even a 2 random LRU scheme.
This is particularly relevant for databases and keeping rows in memory.
The algorithm is something like "add two weight when an item is used, and decay one from every item every second, evict items under pressure from the lowest weight to the highest."
Are there any languages that allow you to give the compiler a hint that you're about to grind over a gigantic dataset so don't bother to cache any of this data because it won't be accessed again for a long time? It seems like it could be helpful in keeping a big crunch from obliterating the cache constantly. You might also be able to apply other optimizations, like preloading the next data blocks so they're ready when the CPU rolls around. Maybe compilers already do this behind the scenes?
There are "stream" instructions available on some CPU. These instructions tell the CPU to store data, but not to load it into a cache line (it might actually load it into a temporary cache line that is not part of the normal cache lines, depending on architecture). This is useful is you are writing data to memory but don't read it - there's no point in loading that data to the cache line since you will not access it again.
One such instruction can be used with the _mm_stream_si128 intrinsic:
void _mm_stream_si128(__m128i *p, __m128i a);
Stores the data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache will be updated. Address p must be 16 byte aligned.
Not the compiler per se, but you can remap memory to mark it as "non-cacheable". This is typically done in drivers for instance, when you access memory-mapped device registers that should most definitely never be cached.
You probably don't want to do that for regular RAM unless you access a lot of data only once and your access pattern is completely unpredictable (which is almost never the case).
Even if you only read the data once having a cache means that the prefetcher can work its magic if it notices a pattern in your accesses, preloading what it thinks you'll read next in the cache ahead of you.
You can also help the prefetcher by explicitly adding preload instructions (if the ISA allows it, at least) if you know ahead of time where you'll read next but you expect the prefetcher won't be able to guess it. That's useful if for instance you're navigating through some sort of linked list data structure, where the access pattern is potentially random looking "from the outside" but you know where you'll go next in your algorithm.
Don't. Yes, there are instructions for this. No, don't use them, unless you really exactly know what you are doing and optimizing towards a specific, single µarch only, otherwise they will invariably hurt performance, not improve it.
Similarly explicit prefetching usually does not improve performance, but reduces it.
(Non-temporal stores are quite a good example here, since a game engine used them in a few spots until recently, causing not only worse performance on Intel chips, but also heavily deteriorated performance on AMD's Zen µarch. Removing them improved performance for all chips across the bank. Ouch!)
I suspect you will find its sorta unnecessary, most modern machines have stream detection built into the caches as well, and simple memcpy's and the like won't actually roll the entire cache. They might roll a set, or simply stop filling after some number of cache lines. This is also why prefetch hints don't tend to help streaming operations.
That said, most modern machines also have stream oriented instructions (sse nontemporal stores for example) that don't allocate on write, but you have to be careful because for workload after workload they have been shown to be slower. There are a couple reasons, but the most important thing to remember is that microbenchmarks repeatedly copying the same piece of data over and over aren't really reflective of most workloads.
There are non-temporal loads and stores that bypass the cache, but last I checked the non-temporal loads don't actually work with cacheable memory types - this may have changed by now.
The cache partitioning in Broadwell can help with this. If you have a very noisy background task that accesses a lot of data (e.g. like garbage collection scanning), you could partition it with a small slice of cache so that it doesn't evict the data the rest of your application uses. I got myself a Broadwell Xeon v4 for my desktop specifically so I could play with this feature.
The short answer is no, because the machine language doesn't allow general manipulation of the cache. So the cache is pretty strictly controlled by the hardware. In general the hardware works very well for a wide variety of different workloads. It can often work better than a compiler can, because it works at runtime, and is thus flexible to changes in workloads. The preloading optimization you mentioned is sort of done already: it's called prefetching. The idea is a cache will automatically cache memory near where you just accessed, because for a lot of memory intensive operations you are linearly scanning a specific section of memory.
This 2 random cache scheme reminds me a lot of the 2 random load balancing scheme laid out in "The Power of Two Choices in Randomized Load Balancing" https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis....
Which shows an exponential improvement in choosing the less loaded of two randomly chosen servers over just random.
The "2 random choices" idea also shows up as a good idea in other settings, such as load balancing for hashing. I wrote a writeup[1], but if you google "power of two random choices" you'll see a lot of literature on it.
Hello, now Redis (4.0, still in release candidate) supports LFU (Least Frequently Used) which should work better in most cases since LRU also is also just a way to approximate LFU, since the most frequently used items are often the last recently used ones.
# LRU and minimal TTL algorithms are not precise algorithms but approximated
# algorithms (in order to save memory), so you can select as well the sample
# size to check. For instance for default Redis will check three keys and
# pick the one that was used less recently, you can change the sample size
# using the following configuration directive.
#
# maxmemory-samples 3
[...] choosing the least recently used (LRU) is an obvious choice, since you’re more likely to use something if you’ve used it recently.
There are many other cache replacement policies [1] and they can outperform LRU especially if the quoted assumption is not true. It is for example quite common to have two types of data, one is frequently reused over very long periods, the other is reused over short periods after the first use. In those cases a more complex policy like ARC or CAR can provide noticeable improvements.
> It is for example quite common to have two types of data, one is frequently reused over very long periods, the other is reused over short periods after the first use.
I can't help but see the parallels with garbage collection algorithms, which seem to actually be the same problem approached from a different angle. Which objects should be scanned for cleanup and how often for optimal use of time maps very closely to what items to keep or event from a cache.
Ideally we could peer into a crystal ball and know which items are going to be accessed in the near future and keep those in the cache at all costs. Since we don't know that, we use a predictor based on past behavior. If past behavior is random, then prediction is for shit. So we assume that behavior has locality: something referenced recently is more likely to be referenced again than something not referenced recently. Based on that guess, we sort things from least recently used to most recently (or approximately sort, using aging tricks to avoid actually sorting), and boot out the least recently used entries.
If accesses are completely random, then the LRU selection is pure overhead. Or worse; some patterns, like repeated sequential accesses of something large through a small cache, are anti-locality. If we are accessing a big buffer in a circle, then the next thing we are accessing at all times is precisely the least-recently used item that LRU wants to boot out.
For situations like that, caching mechanisms sometimes support some API for providing a hint. "These memory locations are going to be needed soon". "This memory mapped file is being sequentially accessed". And such.
Another characteristic of the data which would make random particularly bad is if reloading/recomputing the most popular items took significantly longer than reloading unpopular items. This could be the case for instance if item data size grew in proportion to popularity.
...on real workloads, random tends to do worse than other algorithms. But what if we take two random choices and just use LRU between those two choices?
"2-random" is his short hand for the above scenario.
[edit: formatting, I have no idea how to signify a quote; I don't want preformatted because it makes a scroll box.]
Basically pick-2 allows you to make a mixture of a totally random and optimal choice in cases where you have imperfect information that is much more resilient to the staleness of data than either a random choice or the globally optimum choice.
that's a horrible pattern. As soon as you go 1 bit over your cache size in a hot loop you'll have a 100% miss rate. (assuming each element is loaded once in the hot loop).
There is definitely a case where this applies, which is the case where going over your cache size is an exceptional circumstance that you don't expect to reach very often. In that case, all the bookkeeping for determining what to evict is wasted effort.
Curious - why wouldn't your CPU L1/L2/L3 caches always be "full?" What does it even mean to have a "full" cache in that context? Wouldn't they, under normal operations, typically fill up in a few seconds? (If not milliseconds?)
When the cost of retrieving an item from your slower storage is pretty much the same for every item, may it be old or new, small or big, popular or not.
Ajax made predictive (random)prefetching mainstream .its only natural that random eviction algorithm are coming along . when you pick one end of stick you pick another too
[+] [-] ordinaryperson|9 years ago|reply
Highly recommended.
[+] [-] falcolas|9 years ago|reply
This is particularly relevant for databases and keeping rows in memory.
The algorithm is something like "add two weight when an item is used, and decay one from every item every second, evict items under pressure from the lowest weight to the highest."
[+] [-] rozim|9 years ago|reply
[+] [-] jandrese|9 years ago|reply
[+] [-] kevindqc|9 years ago|reply
One such instruction can be used with the _mm_stream_si128 intrinsic:
void _mm_stream_si128(__m128i *p, __m128i a);
Stores the data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache will be updated. Address p must be 16 byte aligned.
Learned about it in "What every programmer should know about memory": https://www.akkadia.org/drepper/cpumemory.pdf
It's from 2007, but still relevant and a good read.
[+] [-] simias|9 years ago|reply
You probably don't want to do that for regular RAM unless you access a lot of data only once and your access pattern is completely unpredictable (which is almost never the case).
Even if you only read the data once having a cache means that the prefetcher can work its magic if it notices a pattern in your accesses, preloading what it thinks you'll read next in the cache ahead of you.
You can also help the prefetcher by explicitly adding preload instructions (if the ISA allows it, at least) if you know ahead of time where you'll read next but you expect the prefetcher won't be able to guess it. That's useful if for instance you're navigating through some sort of linked list data structure, where the access pattern is potentially random looking "from the outside" but you know where you'll go next in your algorithm.
[+] [-] dom0|9 years ago|reply
Similarly explicit prefetching usually does not improve performance, but reduces it.
(Non-temporal stores are quite a good example here, since a game engine used them in a few spots until recently, causing not only worse performance on Intel chips, but also heavily deteriorated performance on AMD's Zen µarch. Removing them improved performance for all chips across the bank. Ouch!)
[+] [-] StillBored|9 years ago|reply
That said, most modern machines also have stream oriented instructions (sse nontemporal stores for example) that don't allocate on write, but you have to be careful because for workload after workload they have been shown to be slower. There are a couple reasons, but the most important thing to remember is that microbenchmarks repeatedly copying the same piece of data over and over aren't really reflective of most workloads.
[+] [-] eloff|9 years ago|reply
The cache partitioning in Broadwell can help with this. If you have a very noisy background task that accesses a lot of data (e.g. like garbage collection scanning), you could partition it with a small slice of cache so that it doesn't evict the data the rest of your application uses. I got myself a Broadwell Xeon v4 for my desktop specifically so I could play with this feature.
[+] [-] jnordwick|9 years ago|reply
For memory mapped regions there is madvise() DONT_NEED to tell the kernel to that you won't need this data again so don't bother with the disk cache.
For the CPU cache the non-temporal instructions MOVNTI that tells the CPU to not bother caching the move to prevent cache pollution.
[+] [-] openasocket|9 years ago|reply
[+] [-] jamesfmilne|9 years ago|reply
"A case for the non-temporal store" https://blogs.fau.de/hager/archives/2103
[+] [-] JoshTriplett|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] SomewhatLikely|9 years ago|reply
[+] [-] colonelxc|9 years ago|reply
[+] [-] j2kun|9 years ago|reply
[1]: https://jeremykun.com/2015/12/28/load-balancing-and-the-powe...
[+] [-] krallja|9 years ago|reply
I wonder if it would be worth adding a k-random eviction strategy, for a balance between the two.
[+] [-] antirez|9 years ago|reply
[+] [-] BillinghamJ|9 years ago|reply
Edit: this is clearer - https://redis.io/topics/lru-cache#approximated-lru-algorithm
Excerpt from https://raw.githubusercontent.com/antirez/redis/2.8/redis.co...
[+] [-] danbruc|9 years ago|reply
There are many other cache replacement policies [1] and they can outperform LRU especially if the quoted assumption is not true. It is for example quite common to have two types of data, one is frequently reused over very long periods, the other is reused over short periods after the first use. In those cases a more complex policy like ARC or CAR can provide noticeable improvements.
[1] https://en.wikipedia.org/wiki/Cache_replacement_policies
[+] [-] bnegreve|9 years ago|reply
The article is about CPU caches so speed and memory usage are very critical.
One of the main benefit of LRU is that it only needs one extra bit per cache line.
[+] [-] kbenson|9 years ago|reply
I can't help but see the parallels with garbage collection algorithms, which seem to actually be the same problem approached from a different angle. Which objects should be scanned for cleanup and how often for optimal use of time maps very closely to what items to keep or event from a cache.
[+] [-] kazinator|9 years ago|reply
Ideally we could peer into a crystal ball and know which items are going to be accessed in the near future and keep those in the cache at all costs. Since we don't know that, we use a predictor based on past behavior. If past behavior is random, then prediction is for shit. So we assume that behavior has locality: something referenced recently is more likely to be referenced again than something not referenced recently. Based on that guess, we sort things from least recently used to most recently (or approximately sort, using aging tricks to avoid actually sorting), and boot out the least recently used entries.
If accesses are completely random, then the LRU selection is pure overhead. Or worse; some patterns, like repeated sequential accesses of something large through a small cache, are anti-locality. If we are accessing a big buffer in a circle, then the next thing we are accessing at all times is precisely the least-recently used item that LRU wants to boot out.
For situations like that, caching mechanisms sometimes support some API for providing a hint. "These memory locations are going to be needed soon". "This memory mapped file is being sequentially accessed". And such.
[+] [-] SomewhatLikely|9 years ago|reply
[+] [-] mapgrep|9 years ago|reply
[+] [-] sethammons|9 years ago|reply
...on real workloads, random tends to do worse than other algorithms. But what if we take two random choices and just use LRU between those two choices?
"2-random" is his short hand for the above scenario.
[edit: formatting, I have no idea how to signify a quote; I don't want preformatted because it makes a scroll box.]
[+] [-] mnarayan01|9 years ago|reply
[+] [-] plandis|9 years ago|reply
Basically pick-2 allows you to make a mixture of a totally random and optimal choice in cases where you have imperfect information that is much more resilient to the staleness of data than either a random choice or the globally optimum choice.
[+] [-] PaulHoule|9 years ago|reply
[+] [-] Coding_Cat|9 years ago|reply
[+] [-] kabdib|9 years ago|reply
[+] [-] rspeer|9 years ago|reply
There is definitely a case where this applies, which is the case where going over your cache size is an exceptional circumstance that you don't expect to reach very often. In that case, all the bookkeeping for determining what to evict is wasted effort.
[+] [-] ghshephard|9 years ago|reply
[+] [-] nsebban|9 years ago|reply
[+] [-] pacificleo11|9 years ago|reply