top | item 38858451

(no title)

1a1a11a | 2 years ago

Good points.

The first point is probably true for most research in computer systems. Different systems have different constraints that make a simple design very complex or not possible.

Can you elaborate on the second point more? It feels to me that S3-FIFO is optimized for latency, throughput, and scalability than existing algorithms (e.g., with multiple LRU queues).

discuss

jandrewrogers|2 years ago

On the second point, there are a few different aspects that designs typically navigate. Effective thread concurrency used to be one, but the strong trend toward thread-per-core architecture has mooted that to a significant extent. More common is how many cold cache lines a cache replacement hits and how amenable the algorithm is to the CPU’s preference for sequential access.

You are probably familiar with this problem but it is trickier than many average software devs allow. Basic CLOCK-derived algorithms are an obvious choice for CPU efficiency, and easy to patch up for most weaknesses, but when cache metadata becomes large, which is common on modern servers, they have a tendency to thrash the CPU cache for some access patterns. This has some pretty negative performance effects. Not having to worry about multithreading (because thread-per-core) enabled some modifications that came about as close I’ve seen to ensuring that cache replacement lives in hot CPU cache lines. This largely involves ensuring that you can find an immediately evictable buffer with a short scan with very high probability, and if that fails (a rare case), the scan metadata is so compressed that it often easily fits within L1. And that metadata is often evaluated with massive parallelism in a handful of clock cycles using vector intrinsics. AFAIK, these variants are the strongest cache replacement designs in terms of CPU efficiency (not necessarily cache hit rate for some access patterns).

Some of the core techniques above are things I learned from other database kernel designers who themselves learned it from someone else. The scan length reduction technique was so elegant and obvious in hindsight that I am surprised that it never occurred to me or most anyone else (though it does make the designs more subtle). Quite a lot of these practical details have never been written down a far as I can tell, it is passed along as lore and beer convos. I am as guilty of this as anyone else.

I thought the SIEVE algorithm was great, by the way! This area of research deserves far more attention than it seems to receive, most people act like it is a fully solved problem. Caches are in the hot path of everything.

That said, in recent years I have become a lot more interested in the theoretical limits of effective cache replacement, regardless of algorithm. There are modern workloads that are asymptotically approaching randomness for all cache replacement even when the cache is 10-20% of the working set. Being able to narrowly split and assign workload to things that are effectively cacheable and things that are not is becoming important. I routinely deal with workloads now where <1% of the working set fits in memory, and it is becoming more common. Caching is important but the impact on overall performance is diminishing.

1a1a11a|2 years ago

This is an awesome reply that I would like to see here. Do you mind sharing more details on what type of cache or use case needs the performance you mentioned? Database bufferpool?

> a lot of these practical details have never been written down a far as I can tell, it is passed along as lore and beer convos

I wish the valuable knowledge could be written down and shared.

Yup, I am on the same page as you, after seeing the many state-of-the-art having similar miss ratios, I am inclined to think we are close to the maximum possible hit ratio bounded by the limited information from the request history (information theory).