top | item 31173030

(no title)

dormando | 3 years ago

Now a more philosoraptor style comment: I see Mcrib is a service built to quickly detect and replace memcached's. I treat memcached in infrastructure as a very stable service. Meaning it is infrequently necessary to upgrade it, and it will generally not fail on its own. If it does it will be highly infrequent compared to services with higher churn or more complexity/dependencies. This means if they're failing often enough that you need to rapidly detect and replace them you have a more fundamental problem.

From a structural standpoint I think my technical comment can be useful. If things really are failing this much A) you should figure out why and slow that down. B) if you have a generally stable system and understand the typical rate of failure, you can add tripwires into Mcrib to avoid over-culling services and loudly raise alarms. Then C) you can improve technical reliability with redundancy/extstore/etc.

I've also seen plenty of times where folks have a dependency of a service determine if that service is usable, which I disagree with quite strongly. Consul being down on a node should trigger something to consider if the service is dead. It's important both for reliability (don't kill perfectly working things because you end up having to design around it), and for maintainability as you've now made people afraid of upgrading Consul or other co-dependent services. Other similar failures are single-point-of-testing availability checking where instead you probably want two points of truth before shooting a service.

Now you risk people being afraid of upgrading probably anything, which means they will work around it, abstract it, or needlessly replace it with something they feel safer managing. The latter is at best a waste of time, at worst a time bomb until you find out what conditions this new thing breaks under.

This isn't advocating that you design without assuming anything can fail anywhere at any time; just pointing out that how often a service _should_ fail is extremely useful information when designing systems and designing fail safes, alerts, monitoring, etc.

discuss

ransom1538|3 years ago

"I treat memcached in infrastructure as a very stable service."

I run memcached at a large scale. You are totally right. Every other year we will find ONE bad memcached node down. We use nutcraker instead of mcrouter for consistent hashing to each memcache node. Once i read "We also run a control plane for the cache tier, called Mcrib. Mcrib’s role is to generate up-to-date Mcrouter configurations" -- I was like oooooh boy, here we go....

Knowing memcache is a rock comes with experience though.

iamcal|3 years ago

Our underlying hardware (AWS) is nothing like this reliable. We see regular (several times a year) failure of racks of machines or whole DCs.

Across the whole fleet (all services), we lose 1-10 servers per day as a baseline. Major events are then on top of that and can impact thousand of hosts at once.

muglug|3 years ago

> I run memcached at a large scale

I don't believe you run it at the scale Slack does.

The people at Slack who decided to use Mcrouter (and created Mcrib) have experience running Memcached, Mcrouter and Nutcracker in production at two of the biggest web properties in the world.

Trust that they know whereof they speak.

tuetuopay|3 years ago

I think you nailed the real issue that caused the incident: saying "consul down == unhealthy memcached", then evicting the node. If Mcrib instead did some actual applicative healthchecks (e.g. memcached ping), which could be correlated with some system metrics (cpu, ram), it could avoid evicting those perfectly good nodes with a warm cache that just happen to have a restarting consul agent.

Granted, this is easy to say once the incident happened with an excellent postmortem, but this should be an industry-wide wakeup call: don't do this.

I have the same issue at work, where people treat a "prometheus node_exporter down" as a "the app on the machine is down". I've started to add the actual app name in our alerts, and now people don't freak out anymore when they see "down" alerts: oh node_exporter is down, but not the app? Don't panic and calmly check why.

bognition|3 years ago

It’s likely that the memcached install is so large that the underlying instances themselves are failing. When you have hundreds or thousands of instances, failures in the instances themselves become pretty regular.

_jal|3 years ago

I don't see this. I have thousands of long-lived instances - full VMs, not containers, running in our hardware.

If they start "going bad", something is wrong. That's a signal I wouldn't want to ignore.

It has happened - once an HBA in a storage node was causing occasional corruption, another time due to a communication failure people were building things with the wrong version of something which had a memory leak and would eventually summon the OOM killer. There have been other issues.

"Have you tried turning it off and back on again" is still a terrible system management strategy.

dormando|3 years ago

I can say with certainty this isn't strictly true. The failures should be relatively rare; when I say relatively I mean on the level of natural node failure. If natural node failure isn't survivable without special systems to quickly replace downed nodes you don't actually have an N+1 redundancy system. Thus, the pools aren't large enough :) Or, in this case, if they really are failing this much then having them always lose their cache is a major reliability hole.

It's a subtle difference. I think many operators get used to node failures being extremely common when they don't necessarily have to be. I suspect the note on "if they come back on their own ensure they're flushed" meaning they have something unusual causing ephemeral failures. If that's just "cloud networking" there isn't much they can do but it's almost always fixable.

sandGorgon|3 years ago

more likely - they are using "spot instances" for memcached, which will cause them to be evicted fairly frequently.

prescriptivist|3 years ago

Or horizontal autoscaling based on demand.