(no title)
dormando | 3 years ago
From a structural standpoint I think my technical comment can be useful. If things really are failing this much A) you should figure out why and slow that down. B) if you have a generally stable system and understand the typical rate of failure, you can add tripwires into Mcrib to avoid over-culling services and loudly raise alarms. Then C) you can improve technical reliability with redundancy/extstore/etc.
I've also seen plenty of times where folks have a dependency of a service determine if that service is usable, which I disagree with quite strongly. Consul being down on a node should trigger something to consider if the service is dead. It's important both for reliability (don't kill perfectly working things because you end up having to design around it), and for maintainability as you've now made people afraid of upgrading Consul or other co-dependent services. Other similar failures are single-point-of-testing availability checking where instead you probably want two points of truth before shooting a service.
Now you risk people being afraid of upgrading probably anything, which means they will work around it, abstract it, or needlessly replace it with something they feel safer managing. The latter is at best a waste of time, at worst a time bomb until you find out what conditions this new thing breaks under.
This isn't advocating that you design without assuming anything can fail anywhere at any time; just pointing out that how often a service _should_ fail is extremely useful information when designing systems and designing fail safes, alerts, monitoring, etc.
ransom1538|3 years ago
I run memcached at a large scale. You are totally right. Every other year we will find ONE bad memcached node down. We use nutcraker instead of mcrouter for consistent hashing to each memcache node. Once i read "We also run a control plane for the cache tier, called Mcrib. Mcrib’s role is to generate up-to-date Mcrouter configurations" -- I was like oooooh boy, here we go....
Knowing memcache is a rock comes with experience though.
iamcal|3 years ago
Across the whole fleet (all services), we lose 1-10 servers per day as a baseline. Major events are then on top of that and can impact thousand of hosts at once.
muglug|3 years ago
I don't believe you run it at the scale Slack does.
The people at Slack who decided to use Mcrouter (and created Mcrib) have experience running Memcached, Mcrouter and Nutcracker in production at two of the biggest web properties in the world.
Trust that they know whereof they speak.
tuetuopay|3 years ago
Granted, this is easy to say once the incident happened with an excellent postmortem, but this should be an industry-wide wakeup call: don't do this.
I have the same issue at work, where people treat a "prometheus node_exporter down" as a "the app on the machine is down". I've started to add the actual app name in our alerts, and now people don't freak out anymore when they see "down" alerts: oh node_exporter is down, but not the app? Don't panic and calmly check why.
bognition|3 years ago
_jal|3 years ago
If they start "going bad", something is wrong. That's a signal I wouldn't want to ignore.
It has happened - once an HBA in a storage node was causing occasional corruption, another time due to a communication failure people were building things with the wrong version of something which had a memory leak and would eventually summon the OOM killer. There have been other issues.
"Have you tried turning it off and back on again" is still a terrible system management strategy.
dormando|3 years ago
It's a subtle difference. I think many operators get used to node failures being extremely common when they don't necessarily have to be. I suspect the note on "if they come back on their own ensure they're flushed" meaning they have something unusual causing ephemeral failures. If that's just "cloud networking" there isn't much they can do but it's almost always fixable.
sandGorgon|3 years ago
prescriptivist|3 years ago