> On January 4th, one of our Transit Gateways became overloaded. The TGWs are managed by AWS and are intended to scale transparently to us. However, Slack’s annual traffic pattern is a little unusual: Traffic is lower over the holidays, as everyone disconnects from work (good job on the work-life balance, Slack users!). On the first Monday back, client caches are cold and clients pull down more data than usual on their first connection to Slack. We go from our quietest time of the whole year to one of our biggest days quite literally overnight.
What's interesting is that when this happened, some HN comments suggested it was the return from holiday traffic that caused it. Others said, "nah, don't you think they know how to handle that by now?"
Turns out occam's razor applied here. The simplest answer was the correct one. Return-from-holiday traffic.
But Slack has been around for longer than a year, right? Shouldn't they have noticed this happening earlier?
I mean, considering Slack is mostly used as a workplace chat mechanism, they should have faced this kind of a scenario previously and had a solution for this by now.
- Disable autoscaling if appropriate during outage. For example if the web server is degraded, it's probably best to make sure that the backends don't autoscale down.
- Panic mode in Envoy is amazing!
- Ability to quickly scale your services is important, but that metric should also take into account how quickly the underlying infrastructure can scale. Your pods could spin up in 15 seconds but k8s nodes will not!
The thing that always worries me about cloud systems are the hidden dependencies in your cloud provider that work until they don't. They typically don't output logs and metrics, so you have no choice to pray that someone looks at your support ticket and clicks their internal system's "fix it for this customer" button.
I'll also say that I'm interested in ubiquitous mTLS so that you don't have to isolate teams with VPCs and opaque proxies. I don't think we have widely-available technology around yet that eliminates the need for what Slack seems to have here, but trusting the network has always seemed like a bad idea to me, and this shows how a workaround can go wrong. (Of course, to avoid issues like the confused deputy problem, which Slack suffered from, you need some service to issue certs to applications as they scale up that will be accepted by services that it is allowed to talk to and rejected by all other services. In that case, this postmortem would have said "we scaled up our web frontends, but the service that issues them certificates to talk to the backend exploded in a big ball of fire, so we were down." Ya just can't win ;)
Experienced something similar with mongo atlas today. Our primary node went down and the cluster didn’t failover to either of the secondaries. We got to sit with our production environment completely offline while staring at two completely functional nodes that we had no ability to use. Even when we managed to get hold of support they also seemed unable to trigger a failover and basically told us to wait for the primary node to come back up. It took 90 minutes in the end and has definitely made us rethink about the future and the control we’ve given over.
Some customers have a hard requirement that their slack instances be behind a unique VPC. Other customers are easier to sell to if you sprinkle some “you’ll get your own closed network” on top of the offer, if security is something they’ve been burned by in the past.
I agree with you the mTLS is the future. It exists within many companies internally (as a VPC alternative!) and works great. There’s some problems around the certificate issuer being a central point of failure, but these are known problems with well-understood solutions.
I think there’s mostly a non-technical barrier to be overcome here, where the non-technical executives need to understand that closed network != better security. mTLS’s time in the sun will only come when the aforementioned sales pitch is less effective (or even counterproductive!) for Enterprise Inc., I think.
So many fails due to in-band control and monitoring are laid bare, followed by this absolute chestnut -
> We’ve also set ourselves a reminder (a Slack reminder, of course) to request a preemptive upscaling of our TGWs at the end of the next holiday season.
Probably the only way to see a problem is if you have a flat line for bandwidth, but as the article suggested they had packet drop wich does not appear on the cloudwatch metrics, aws should add those metrics imo
Didn't we just read a story about the exact same issue?
Traffic picked up heavily on some website or app, AWS didn't auto-scale fast enough or at all and the very systems that are designed to be elastic just tumbled down to a grinding halt?
Update: it was Advent Of Code 2020, where he reported the exact same issue. The AWS auto scaling framework rocked a pooper when the site exploded on release day.
Why don't they mention what seems like a clear lesson: control traffic has to be prioritized using IP DSCP bits, or else your control systems can't recover from widespread frame drop events. Does AWS TGW not support DSCP?
(I also wrote one of the main Google Fiber monitoring systems back when I was at Google. We spent quite a bit of time on monitoring monitoring, because whenever there was an actual incident people would ask us "is this real, or just the monitoring system being down?" Previous monitoring systems were flaky so people were kind of conditioned to ignore the improved system -- so we had to have a lot of dashboards to show them that there was really an ongoing issue.)
The "I've just done my Solutions Architect exam" answer would be that TGW simplifies the topology by having a central hub, rather than each VPC having to peer with all the other VPCs.
I wonder how many VPCs people have before transitioning over to TGW.
Automated scaling has been a persistent problem for me, especially if I try to scale on simple metrics, or even worse (in Slack's case) on metrics that could potentially compete. The situations in which multiple metrics could compete are sometimes difficult to conceive, but it will always happen if you aren't performing something more sophisticated than "up if metric > value, down if < value" for multiple metrics. I think you've got to combine these somehow into a custom metric and scale just on one metric. I'm totally unsurprised to see that autoscaling failed for both Slack and AWS in this case.
I think you really have to look at metric-based autoscaling and say: is it worth the X% savings per month? Or would I rather avoid the occasional severe headaches caused by autoscaling messing up my day? Obviously this depends on company scale and how much your load varies. I'd rather have an excess of capacity than any impact on users.
The big takeaway for me here is that this “provisioning service” had enough internal dependencies that they couldn’t bring up new nodes. Seems like the worst thing possible during a big traffic spike.
> We run a service, aptly named ‘provision-service’, which does exactly what it says on the tin. It is responsible for configuring and testing new instances, and performing various infrastructural housekeeping tasks. Provision-service needs to talk to other internal Slack systems and to some AWS APIs.
The "configuring and testing new instances" part also sounds very fishy to me. Configuration should be done when creating the image and launch template, while testing should be the job of the load balancing layer. Why do we need a separate "provision-service" to piece everything together?
I wonder if you can pre-warm TGWs like you can ELB? It would be annoying to have to have AWS prewarm a bunch of you stuff, but it's better than it going down.
our emergency backup for slack is zoom. Horrible UX for group chats, but everyone already has it installed and it's quick and simple to set up a new room for each team. For temporary use you can put up with a fair bit of annoying behaviour or lack of features.
tldr: "we added complexity into our system to make it safer and that complexity blew us up. Also: we cannot scale up our system without everything working well, so we couldn't fix our stuff. Also: we were flying blind, probably because of the failing complexity that was supposed to protect us."
I am really not impressed... with the state of IT. I could not have done better, but isn't it too bad that we've built these towers of sand that keep knocking each other over?
The thing is anyone can build a system that can scale out to Slack's level given enough machines and money. What's harder is scaling out to that level and not burning gobs of cash.
It's similar to the whole buildings a long time ago last much longer than those today. Its true in the literal sense, but it ignores the fact that we've gotten at reducing the cost of stuff like skyscrapers and bridges.
In our pursuit of efficiency, we do things like JIT delivery, dropshipping, scaling, building to the minimum spec. Sometimes, we get it wrong and it comes tumbling down (covid, HN hug of death, earthquakes).
dplgk|5 years ago
What's interesting is that when this happened, some HN comments suggested it was the return from holiday traffic that caused it. Others said, "nah, don't you think they know how to handle that by now?"
Turns out occam's razor applied here. The simplest answer was the correct one. Return-from-holiday traffic.
cett|5 years ago
polote|5 years ago
"My bet is that this incident is caused by a big release after a post-holiday "code freeze". "
hnlmorg|5 years ago
floatingatoll|5 years ago
thunderbong|5 years ago
I mean, considering Slack is mostly used as a workplace chat mechanism, they should have faced this kind of a scenario previously and had a solution for this by now.
kparaju|5 years ago
- Disable autoscaling if appropriate during outage. For example if the web server is degraded, it's probably best to make sure that the backends don't autoscale down.
- Panic mode in Envoy is amazing!
- Ability to quickly scale your services is important, but that metric should also take into account how quickly the underlying infrastructure can scale. Your pods could spin up in 15 seconds but k8s nodes will not!
jrockway|5 years ago
I'll also say that I'm interested in ubiquitous mTLS so that you don't have to isolate teams with VPCs and opaque proxies. I don't think we have widely-available technology around yet that eliminates the need for what Slack seems to have here, but trusting the network has always seemed like a bad idea to me, and this shows how a workaround can go wrong. (Of course, to avoid issues like the confused deputy problem, which Slack suffered from, you need some service to issue certs to applications as they scale up that will be accepted by services that it is allowed to talk to and rejected by all other services. In that case, this postmortem would have said "we scaled up our web frontends, but the service that issues them certificates to talk to the backend exploded in a big ball of fire, so we were down." Ya just can't win ;)
bengale|5 years ago
gen220|5 years ago
I agree with you the mTLS is the future. It exists within many companies internally (as a VPC alternative!) and works great. There’s some problems around the certificate issuer being a central point of failure, but these are known problems with well-understood solutions.
I think there’s mostly a non-technical barrier to be overcome here, where the non-technical executives need to understand that closed network != better security. mTLS’s time in the sun will only come when the aforementioned sales pitch is less effective (or even counterproductive!) for Enterprise Inc., I think.
danw1979|5 years ago
> We’ve also set ourselves a reminder (a Slack reminder, of course) to request a preemptive upscaling of our TGWs at the end of the next holiday season.
Thaxll|5 years ago
Probably the only way to see a problem is if you have a flat line for bandwidth, but as the article suggested they had packet drop wich does not appear on the cloudwatch metrics, aws should add those metrics imo
miyuru|5 years ago
keyle|5 years ago
Traffic picked up heavily on some website or app, AWS didn't auto-scale fast enough or at all and the very systems that are designed to be elastic just tumbled down to a grinding halt?
keyle|5 years ago
jeffbee|5 years ago
bovermyer|5 years ago
ignoramous|5 years ago
fullstop|5 years ago
dijit|5 years ago
gscho|5 years ago
Sounds like the monitoring system needs a monitoring system.
jrockway|5 years ago
For Prometheus users, I wrote alertmanager-status to let a third-party "website up?" monitoring server check your alertmanager: https://github.com/jrockway/alertmanager-status
(I also wrote one of the main Google Fiber monitoring systems back when I was at Google. We spent quite a bit of time on monitoring monitoring, because whenever there was an actual incident people would ask us "is this real, or just the monitoring system being down?" Previous monitoring systems were flaky so people were kind of conditioned to ignore the improved system -- so we had to have a lot of dashboards to show them that there was really an ongoing issue.)
TonyTrapp|5 years ago
sargun|5 years ago
tikkabhuna|5 years ago
I wonder how many VPCs people have before transitioning over to TGW.
miyuru|5 years ago
https://slack.engineering/building-the-next-evolution-of-clo...
mbyio|5 years ago
grumple|5 years ago
I think you really have to look at metric-based autoscaling and say: is it worth the X% savings per month? Or would I rather avoid the occasional severe headaches caused by autoscaling messing up my day? Obviously this depends on company scale and how much your load varies. I'd rather have an excess of capacity than any impact on users.
tobobo|5 years ago
throwdbaaway|5 years ago
The "configuring and testing new instances" part also sounds very fishy to me. Configuration should be done when creating the image and launch template, while testing should be the job of the load balancing layer. Why do we need a separate "provision-service" to piece everything together?
nickthemagicman|5 years ago
jjtheblunt|5 years ago
conradfr|5 years ago
ianrw|5 years ago
plaidfuji|5 years ago
allannienhuis|5 years ago
nhoughto|5 years ago
johnnymonster|5 years ago
jeffrallen|5 years ago
I am really not impressed... with the state of IT. I could not have done better, but isn't it too bad that we've built these towers of sand that keep knocking each other over?
bobthebuilders|5 years ago
It's similar to the whole buildings a long time ago last much longer than those today. Its true in the literal sense, but it ignores the fact that we've gotten at reducing the cost of stuff like skyscrapers and bridges.
In our pursuit of efficiency, we do things like JIT delivery, dropshipping, scaling, building to the minimum spec. Sometimes, we get it wrong and it comes tumbling down (covid, HN hug of death, earthquakes).