top | item 31671252

Cache made consistent: Meta’s cache invalidation solution

280 points| uvdn7 | 3 years ago |engineering.fb.com

193 comments

order
[+] rgbrenner|3 years ago|reply
I see a couple of monitoring/reporting systems.. but no caching solution. These are tools to catch bugs in the solution you're using. Good work, but not a solution for cache invalidation.

And regarding those tools:

it doesnt sound like Polaris would handle a network partition well. If a cache invalidation triggers it to check the other caches for consistency... that assumes Polaris will receive that invalidation message. Imagine a scenario of 5 cache servers, and a Polaris server. On one side of the split is Polaris and 2 servers, and the other has 3 cache servers... It's possible for the 3 cache servers to receive an update that is not received by the polaris+2 network... And polaris would not just be unaware of the inconsistency, but it also wouldn't know to check later for the inconsistency when the network partition is resolved.

I also feel like the consistency tracing is assuming that only one fill and invalidate is occurring at a time (when in practice, there may be multiple fills and invalidates occurring in parallel on the same data)... and that those calls will arrive in order. If they arrive out of order, it doesnt sound like it would catch that.. and I think you're relying on Polaris to catch this case, but high latency cannot be differentiated from a network partition except the length of the delay... so these two types of errors can be seen together.. in which case, you'd have a cache error that neither tool would detect.

I would like to hear why Im wrong.

I understand this is being used in prod, but network partitions don't occur everyday... and Im not convinced this has seen enough flaky networks to work out the bugs.

[+] uvdn7|3 years ago|reply
> but not a solution for cache invalidation

Yeah I have learned about some people have different perceptions of what cache invalidation means (I have a narrower definition, a subset of what some people think as cache invalidation). I will not die on this hill. I am happy to rename/rephrase/anything to be helpful.

> it doesnt sound like Polaris would handle a network partition well ...

Good question. And you are right. My thought on this is that over time, polaris would still catch those unique issues happened at network partition or whatnot. Polaris doesn't promise have 100% coverage all the time. However, over time, if there are flaws in the system, Polaris should help surface it. And once it did (even with just one inconsistent sample), it should be very actionable, and we can find out why it happened, and fix the underlying cause.

> I also feel like the consistency tracing is assuming that only one fill and invalidate is occurring at a time (when in practice, there may be multiple fills and invalidates occurring in parallel on the same data)... and that those calls will arrive in order. If they arrive out of order, it doesnt sound like it would catch that..

It doesn't make that assumption actually. Distributed systems are state machines. Consistency tracing essentially logs (it doesn't do much detection) state transitions. So when polaris detects an anomaly, we have all the information to help us diagnose. And you are right that fills and invalidations can happen in parallel and it's fine. E.g. if fill happens before invalidation, we can always track state mutations caused by the invalidation. If invalidation is the culprit (the earlier fill can't be because it happened earlier), we would have a log for it.

[+] uvdn7|3 years ago|reply
> Good work, but not a solution for cache invalidation.

Assuming your definition of cache invalidation is about "when/who" to invalidate on writes. Let's actually try solving it.

E.g. in its most generic form, a cache can store arbitrary materialization from any data source. Now when updating the data source, in order to keep caches consistent, you essentially need to transact (cross system transaction) on both the data source and cache(s). Usually cache has more number of replicas, I am not sure running this type of transactions is practical at scale.

What happens if we don't transact on both systems (the data source, and cache)? Well, now whenever the asynchronous update pipeline performs the computation, it's done against a moving data source (not a snapshot of when the write was committed). Now let's say the data source is Spanner, which provides point-in-time snapshots. On Spanner commit you can get a commit time (TrueTime) back. Now using that commit time, to read the data and compute cache update asynchronously can be done.

Because the materialization in cache don't take writes themselves, so updates to them are essentially performed blindly and can be ordered by the commit time (TrueTime).

Now this does assume whatever we cache (the query e.g.) needs to be schematized, and made known to the invalidation pipeline. I think it's a very fair assumption to make. As otherwise (anyone can cache anything without the invalidation pipeline knowing at all), it's pretty obvious that this problem can't be solved.

[+] DeathArrow|3 years ago|reply
>I understand this is being used in prod, but network partitions don't occur everyday...

Can't we assume they tested it before deploying to production by introducing all the weird stuff that might happen in a distributed system like partitions, high latency, so they aren't relying only on "it seems to work in production"?

[+] benlivengood|3 years ago|reply
Do you guarantee referential integrity in TAO? Last I heard the answer is no; complex queries are not supported and clients would have to do their own work with internal versions to achieve a consistent view on the whole graph (if such a consistent view even exists). But it seems to work fine since global referential integrity doesn't seem to be a big deal; there aren't a lot of cases where dependency chains are long enough or actions quick enough that it matters (e.g. a group admin adds a new user, makes them an admin, who adds an admin, and then everyone tries to remove the admin who added them [always locally appearing as if there is >=1 remaining admin], but causes the group to ultimately have no admins when transactions settle). Run into any fun issues like that?

Contrasting that with Spanner where cache coherency is solved by reading from the recent past at consistent snapshots or by issuing global transactions that must complete without conflict within a brief truetime window. I am guessing the cost of Spanner underneath the social graph would be a bit too much for whatever benefits might be gained, but curious if anyone looked into using something similar.

[+] uvdn7|3 years ago|reply
For the fun issues you described, read-modify-write (either using optimistic concurrency control or pessimistic) can work, if I understand your question correctly.

Spanner is awesome and one of my favorite systems/papers. I think it would be very computational and power expensive to run the social graph workload on a spanner-like system. Do you have an estimate of how many megawatts (if not more) are needed to support one quadrillion queries a day on Spanner?

[+] DeathArrow|3 years ago|reply
>Do you guarantee referential integrity in TAO?

Since I observed data inconsistencies on Facebook multiple times, I believe they aren't as concerned about data integrity as a company doing financial transactions would be.

[+] judofyr|3 years ago|reply
> At a high level, Polaris interacts with a stateful service as a client and assumes no knowledge of the service internals. This allows it to be generic. We have dozens of Polaris integrations at Meta. “Cache should eventually be consistent with the database” is a typical client-observable invariant that Polaris monitors, especially in the presence of asynchronous cache invalidation. In this case, Polaris pretends to be a cache server and receives cache invalidation events.

I'm a bit confused here. Polaris behaves as a regular cache server which receives invalidation events, but isn't the typical bug related to cache invalidation that a service forgets to contact the cache server for invalidation? So this will only catch cases where (1) you remember to contact Polaris, but you forgot to contact other cache servers [which Polaris happens to know about], OR (2) you're not handling errors during invalidation requests to the cache server [and the request to Polaris was successful]? Or are you cache servers "smart" and might have internal logic which "ignores" an invalidation?

What am I missing?

EDIT: Reading through "A real bug we found and fixed this year" and I'm still a bit confused. It seems like a very contrived bug directed directly to how you deal with versioning (e.g. you allow the latest version to be present with stale metadata, or something?). My main concern with cache invalidation is what to invalidate at what time.

[+] elliotswart|3 years ago|reply
I did an analysis of the bug from the blog post and the TAO paper, and from what I can tell the fundamental bug is therefore the fact that the error handler that handles cache invalidation errors drops only lesser versions, while the cache invalidation contract requires replacing equal or lesser versions.

There is a detailed write up here: https://elliotswart.github.io/pragmaticformalmodeling/cachin...

[+] uvdn7|3 years ago|reply
> isn't the typical bug related to cache invalidation that a service forgets to contact the cache server for invalidation?

That's not the case based on our experience at FB. At-least-once delivery is a solved problem basically.

But you are absolutely right that if there's an issue in the invalidation delivery, it's possible that polaris won't receive the event as well. Polaris actually supports a separate event stream (all the way from client initiated writes) to cover this case.

[+] uvdn7|3 years ago|reply
> EDIT: Reading through "A real bug we found and fixed this year" and I'm still a bit confused.

yeah that specific bug is convoluted.

> It seems like a very contrived bug directed directly to how you deal with versioning (e.g. you allow the latest version to be present with stale metadata, or something?).

This is mostly for performance reason that I won't go into details too much here. Version is mostly an internal concept that clients don't care about. So it's OK-ish.

> My main concern with cache invalidation is what to invalidate at what time.

Can you elaborate? Our experience on cache invalidation is that you invalidate on mutation, and in terms of "what to invalidate" it depends on the computation, which you might have to give me a little more details.

[+] uvdn7|3 years ago|reply
I am the author of the blog post. I believe the methodology described should be applicable to most if not all invalidation-based caches. I am serious when I say that cache invalidation might no longer be a hard thing in computer science. AMA!
[+] shrimpx|3 years ago|reply
My problem has never been invalidation as stated in the article, but how to manage interdependent cache entries, so that when one is invalidated, the dependent ones must also be invalidated. For example get_user(123) and get_users_for_org(456). Suppose user with id 123 is part of the org with id 456. When the user is deleted, you have to invalidate the get_users_for_org(456) entry. I haven’t seen any convincing “design pattern” for managing such dependencies.
[+] uvdn7|3 years ago|reply
Yep. I respect this definition, and I tried to clarify things a bit here https://news.ycombinator.com/item?id=31676102.

I am sorry for the confusion. We have internal abstractions that deals with the problem you described (especially in simpler forms, the example you have is fairly managable actually).

EDIT: I can see why it gets complicated if your data model and query is very complicated.

[+] MaxMoney|3 years ago|reply
In the past I've used a timestamp key for the org. Create a key like f"org_{org.pk}" that has a timestamp. Then append that timestamp to the `get_users_for_org`. Now all you have to do is invalidate the org timestamp to generate a rolling key (requires more memory).
[+] pyrolistical|3 years ago|reply
I don’t get it. If you already have the version in the cache, then when a client does GET X, the cache can reply X@v4.

As long as the client submits transactions with “I calculated this based off of X@v4” then the database can reject if there is a newer version of X.

The client can then replay the transactions against the cache and by then there will be a X@v5.

With this scheme you can track the transaction replay rate instead of having to build a new system to read-back the cache for consistency.

To get the traceability on which cache node is stale, the client can forward a header from the cache node that identifies it. Then using the same transaction rejection rate ops has visibility on which cache node isn’t being updated.

No cache invalidation needed. Always just cache forever X@version, it’s just that the cache allows an unstable query for GET X.

[+] dragontamer|3 years ago|reply
Versions of cache-data aren't numbers, they're vectors-of-numbers (at a minimum).

Here's a bunch of versions of variable "X":

X@Alice-V1, X@Alice-V2, X@Bob-V1, X@Alice-V2/Bob-V2, X@Carol-V1

Which version of "X" should you read?

EDIT: Alice, Bob, and Carol are three different servers. What has happened here is that Alice and Bob are closer together, so their updates are faster between each other (synchronizing the writes). Carol is slower for some reason (bad connection?), and is going to update off of old data.

In this case, the two valid caches are X@Alice-V2/Bob-V2, and X@Carol-V1. The other cached data is old and can be discarded.

Things get complicated when you have not only reads (such as in your simplified question), but also writes that are cached (such as what happens in practice).

[+] jitl|3 years ago|reply
Tracking all data dependencies used for all writes in a large system seems rather challenging.

Plus what about read workloads, like “pull message from queue and send it as a push notification to user Y”? I guess it’s fine if the push is re-delivered due to stale cache?

[+] AtNightWeCode|3 years ago|reply
So basically, classic timestamp versioning with some consistency checking. Might work. Cache invalidation is hard. The only problem I ever faced while working in the biz that I could not solve even at theory level was cache related.

What I thought the article would tackle is the dependencies between stored objects. Some solve it with super complicated graphs others by just flushing the entire cache.

At Meta, why not just use flat files or a store that act as one, pubsub the changes, listen on the changes and update aggregated views as needed and store them. Then just short-term cache everything on the edge.

[+] uvdn7|3 years ago|reply
> What I thought the article would tackle is the dependencies between stored objects.

Like tracking dependencies, so it knows what to invalidate on mutations?

[+] londons_explore|3 years ago|reply
> In other words, 99.99999999 percent of cache writes are consistent within five minutes.

Those are the kind of rates that lead to very hard to find bugs. You won't properly write code in downstream systems to properly handle that failure case, and suddenly someone will get paged at 3am because it has just happened years after the code was deployed and nobody really understands how this corner case suddenly happened!

When I'm building systems, every happy codepath should either happen frequently or never.

Having a cache whose design allows it to be stale but only very rarely seems like a bad idea. If I was building something on top of this cache I would have a layer on top to force staleness for 1 in 1000 requests just to force the application built on top to be designed to handle that properly.

[+] uvdn7|3 years ago|reply
I agree with everything you said.

> Having a cache whose design allows it to be stale but only very rarely seems like a bad idea.

And that's exactly why we first brought this number from 6 9's to more than 10 9's; and we are not done yet. Also the cache is not "designed" to be inconsistent. It's like Paxos is designed to solve the consensus problem, but when implemented, most Paxos implementations do not work properly.

[+] Hnrobert42|3 years ago|reply
You can’t even visit the FB blog while using NordVPN. Amazing.
[+] mbj111|3 years ago|reply
Tweet thread from veteran engineer - balance between the desired system properties and the costs of coordination (https://twitter.com/MarcJBrooker/status/1534944325997453312)
[+] uvdn7|3 years ago|reply
Marc put it extremely well. I agree with every single word of his thread. I should have applied a narrower and more specific definition of cache invalidation in the blog post. I apologize for any confusions it caused.
[+] irrational|3 years ago|reply
Phil Karlton famously said, “There are only two hard things in computer science: cache invalidation and naming things and off-by-one errors.”
[+] eckesicle|3 years ago|reply
and exactly once delivery and exactly once delivery.
[+] rossmohax|3 years ago|reply
How both of these can be true?

> Data in cache is not durable, which means that sometimes version information that is important for conflict resolution can get evicted.

> We also added a special flag to the query that Polaris sends to the cache server. So, in the reply, Polaris would know whether the target cache server has seen and processed the cache invalidation event.

To make special flag work, cache server need to track not only current version state, but also past versions. If it tracks past versions, then conflicts can be resolved at the cache server level, but whole premise of article is that cache servers can't resolve conflicts by themselves.

[+] uvdn7|3 years ago|reply
Great question!

I will try to answer this one without too much implementation details.

First of all, cache items can be evicted and are not durable, which is just a fact. But that doesn't mean we can't track progress (in terms of "time" or "logical time"/"order" in distributed system terms). The social graph is sharded (not surprisingly). We can keep progress of each cache host per shard, which is just a map kept in memory, which doesn't get evicted.

I hope this answers your question.

[+] VWWHFSfQ|3 years ago|reply
As naive as it is, I always kinda liked MySQL's query cache invalidation strategy: blow away the whole cache anytime a mutating statement arrives at the server.

Simple. Effective. But it obviously doesn't work for everything.

[+] Trufa|3 years ago|reply
Cause who needs scaling anyway!? Joking aside, strategy seems logical for smaller stuff.
[+] gwbas1c|3 years ago|reply
> Based on consistency tracing, we know the following happened...

I'm a little confused:

- Is the database updated notification going out while the transaction is still in-progress? Doesn't it make more sense to delay the notification until after the transaction commits?

- If a cache tries to update itself while there's an open transaction against the data it's trying to read, shouldn't the database block the operation until the write is complete? And then shouldn't another notification after the transaction commits trigger the cache to update?

[+] uvdn7|3 years ago|reply
> Is the database updated notification going out while the transaction is still in-progress?

No, that's not what happened.

> If a cache tries to update itself while there's an open transaction against the data it's trying to read, shouldn't the database block the operation until the write is complete?

Yes. Generally speaking, having a read transaction here would be better. There are unrelated reasons why it's done this way. The point of the example is that it's really intricate and we can still identify the bug regardless.

[+] pdevr|3 years ago|reply
From the article:

>>Polaris pretends to be a cache server

De facto, it is a cache server, with the associated performance decrease. Isn't the main purpose of caching to improve performance?

>>We deploy Polaris as a separate service so that it will scale independently from the production service and its workload.

Assuming the scaling is horizontal, so then, to synchronize among the service instances, what do you do? Create another meta-Polaris service? Not rhetoric or sarcasm - hoping for an open discussion.

[+] uvdn7|3 years ago|reply
> De facto, it is a cache server, with the associated performance decrease. Isn't the main purpose of caching to improve performance?

Polaris only receives cache invalidation event, and doesn't serve any client queries.

> so then, to synchronize among the service instances

It doesn't synchronize among the service instances. It's actually basically stateless. It pretends to be a cache server only for receiving invalidation events, and acts as a client, and assumes no knowledge of the cache internals.

[+] orf|3 years ago|reply
> Take TAO, for example. It serves more than one quadrillion queries a day

That’s 11,574,074,074 requests per second.

[+] throwdbaaway|3 years ago|reply
Bigger than the human population queries per second.

1 request from human would likely create multiple queries internally.

[+] DeathArrow|3 years ago|reply
>Distributed systems are essentially state machines. If every state transition is performed correctly, we will have a distributed system that works as expected.

I imagine huge distributed systems as Markov chains, where transitions are probabilistic rather than deterministic.

[+] uvdn7|3 years ago|reply
Interesting… what are your thoughts on Lamport’s TLA+?
[+] DeathArrow|3 years ago|reply
So you measure with Polaris and have found you have a data consistency of up to 10 nines in a period of 5 minutes. But aren't inconsistencies building over timer?

What if you measure for 1 hour instead of 5 minutes? Still 10 nines?

[+] uvdn7|3 years ago|reply
This is a very good question.

The number of nines actually go up with increasing timescale windows because Polaris checks anomalies at write-time/invalidation-time.

I guess the behavior of what you are describing is cache inconsistencies at read-time. E.g. if I introduced an inconsistent cache entry at time T, it can be exposed to many subsequent reads (hence the "building up over time" as you mentioned). This is an important metric as well – we actually measure it as well.

The key difference between read-time cache consistency measurement and write-time cache consistency measurement is about "purpose". Write-time cache consistency measurement is more actionable, as it captures the moment (or very close) of when cache becomes inconsistent. If one wants to debug something, you want to get close to when the anomaly happens. Read-time cache consistency measurement is more about measuring the negative impact of cache inconsistencies (which are client facing).

[+] c3534l|3 years ago|reply
I've heard this sort of thing is almost as hard as naming things.
[+] hoseja|3 years ago|reply
Yeah the nines are very trendy and impressive but wouldn't it be more comprehensible to say "one in a million" and "one in ten billion"?
[+] FranksTV|3 years ago|reply
The nines are just a way to say 1e-5 without having to use notation.