Whenever I see a product this large having an outage (a lot recently, especially with Facebook, GCP, and AWS), I can only think of how stressful it must be for whoever needs to fix it. I've been on the other side of the outage before, albeit with a much smaller product, but lord was that stressful. Thinking about the random engineers that are stressed out and thinking that they could be fired for this, even if the cause wasn't their fault (since management at large companies can be pretty thick) is very upsetting to me. Say what you want about a large company having a large outage, but it's normal engineers that are trying to fix it at the end of the day, and I can sympathize.
I can also sympathize with the engineers trying to fix this, but I hope that they wouldn't be thinking that they could be fired for this. Successful teams that I have worked on, even at big companies with high-usage products, have always promoted a culture of "systems break - let's improve the system, not blame a person". Any 'mistake' by an employee is actually a sign of a problem with the system. Any resilient system should account for human errors - those always happen. I wouldn't want to work for a team or company that would consider firing somebody for causing an outage rather than addressing the root cause.
Response for this incident went by the book, as described in Brent's talk above. Incident Management programs like these ensure that incidents can be resolved while also minimizing stress and chaos for engineers and other responders.
PagerDuty has a good Incident Responder and Incident Commander training courses, if you are interested in setting up a program similar to Slack's:
The cool thing about Slack (and Discord for that matter) is that it's essentially federated on a technical level. SlackHQ controls all the instances, but the UX is very achievable by an open source alternative. I'm bullish on federated chat in the next few years.
It would be awesome to have essentially the same experience as using Slack, but also a server running on our lab's hardware that we can reboot/debug ourselves.
They've got to nail the integrations. Anytime moving to slack alternatives is brought up with people I work with, it always comes down to everything integrating with slack.
One fun thing we discovered an outage or two ago: Zoom, which is likely already on your computer too, has a very Slack-like Chat feature. Open up the Zoom app, then click the Chat icon in the top header. You probably missed it this entire time. You can make rooms, DMs, animated gifs, the works :)
It's strange but Zoom as chat was actually how I first used it. A whole year prior to the pandemic the place I worked (fairly large) they purchased Zoom as their internal chat tool.
That spyware thankfully isn't no mine, and if you're forced to use it (if you for whatever reason can't use Jitsi/Teams) then I strongly encourage you not to install the app and instead use the web version.
Please don't encourage the use of Zoom, they already have enough market share they don't deserve after lying through their teeth to the public on multiple occasions.
Things like this remind me how dependent I am on IM for doing any work, which is weird because it doesn't feel like that should be the case. If you were to measure the raw number of seconds I spend sending IMs, it would be a relatively low percentage of my day, but I always forget how vital that percentage actually is.
As it stands, while I'm not fully blocked from being productive, I am blocked for a lot of work that needs to get done.
Yep. Especially if your development flow has integrated Slack alerts.
Coupled with COVID work-from-home constraints, there's no practical way right now to get code reviewed for merge into the main codebase 'round these parts.
We switched from Slack to Teams for various reasons about two years ago. I'm still mad about it. Teams blows for chat. The video meetings are fine though.
The MS Teams UX is pretty painful, especially for infrequent, communication-heavy users. Not painful enough to outright prevent work, but clearly more painful than the repercussions of missing a few hours of work.
This seems at best a funny tweet. They're both just tools and frankly terrible in terms of "getting work done". This seems like a great time to go heads down on a deep task and if you really need to contact someone send an email or even phone them.
Surprised no one is talking about how awesome slack is. Think about it in user stories:
* I want to find someone by name => ctrl+K
* I want to search with sane keywords => search, "from:me" "to:<channel name>" work.
* I want to remember someone I talked to recently => people you chat with show up in the list on the left
* I want to keep com channels organized as the ground changes => easily rename channels, favorites UI works well, channel grouping works right.
* I want to give a public emotional response to someone's statement => reactions
* I want to continue discussion of a point someone made, which may not be relevant to everyone in the channel => threads
* I want to edit a typo I just made => hit up on keyboard
It's got that quality Factorio has, where you can let yourself imagine it is the ideal product, and start expecting features you need to be there, rather than not bothering to explore because you think they won't be.
It is AMAZINGLY good at solving actual user stories around communication. I have lots of respect for their PMs.
Sorry, but dropping Markdown support in favor of an input UX that is to this day atrocious gives me little faith in Slack's user stories moving forward. I'm fine if they want to make the default mode more intuitive for a broader audience, but I have yet to hear a good argument for not having a little button to enable raw Markdown mode.
EDIT: See comments below for instructions on re-enabling Markdown mode.
* I used to be able to type a message containing "@channel" or "@here" and just hit Enter without looking, and 100% of the time it would be parsed correctly. Now it fails to detect the "@channel" or "@here" often enough that I have to look to see if it failed, then go back and retype that part of the message until it lights up. Once a week or so, I see a message posted by someone else that contains "@channel" in black, and they expected the message to notify everyone but it didn't.
* Recently Slack pushed everyone to switch from usernames to full names with spaces, and eliminated the entire concept of unique usernames. There were many annoying consequences to this, one of which is that when I type "@name", now I have to look to see if it highlighted, or look to interact with the drop-down menu, instead of just typing and knowing it will work.
* Searching also got worse for the same reason. When I type "@name" in the search bar, it NEVER lights up. For 90% of searches this means I can no longer type in a search query and just press Enter. I always have to look through the drop-down menu and either mouse-click or press down-arrow repeatedly to get to the thing I want.
> A configuration change inadvertently lead to a sudden increase in activity on our database infrastructure. Due to this increased activity, the affected databases failed to serve incoming requests to connect to Slack. We introduced tighter rate limits on connection requests to reduce the load on the system. This meant that some people could not access Slack at all, but also that Slack would continue working for those who were already connected.
> Once the system had stabilized, we began lifting these rate limits to enable more connections to Slack. However, we moved too quickly and the increased activity affected the system again. We reinstated the rate limits and redirected some traffic to the database replicas to relieve the demand on our primary databases.
I wonder if their bottleneck is at vitess vttablet or at mysql.
I run an IRC network in my spare time that has had 13 minutes of actual downtime in the last 15 years (and about 20-25 additional minutes of degraded availability).
The degraded availability thing is because I run multiple servers for latency, normally an IRC outage is super clear, except with netsplits where IRC servers themselves unlink from each other. (I average about 3 of those a year and they last about 30s roughly).
It's not video though, so that doesn't answer your question thoroughly.
At first, I thought it was just my internet connection making fun of me. Then, I asked my co-workers if they experienced the same thing, and they did. It's a bummer that it temporarily made our communications quite hard and unstable.
Slack was down to me, and now that it is back, it is begging me to subscribe to use paid features, with the nag screens blocking my view of important stuff.
I guess it was down so they could do this crappy update.
Most of my company have spotty connection. Down for Windows and Linux alike. Interestingly (and several have commented this already), the mobile app seems to be somewhat working.
[+] [-] jjice|4 years ago|reply
[+] [-] pqseags|4 years ago|reply
[+] [-] rajbot|4 years ago|reply
Response for this incident went by the book, as described in Brent's talk above. Incident Management programs like these ensure that incidents can be resolved while also minimizing stress and chaos for engineers and other responders.
PagerDuty has a good Incident Responder and Incident Commander training courses, if you are interested in setting up a program similar to Slack's:
- https://response.pagerduty.com/training/courses/incident_res...
- https://response.pagerduty.com/training/incident_commander/
Fun fact: Brent Chapman is also known for creating the `majordomo` mailing list manager from the early 90s
[+] [-] lima|4 years ago|reply
[+] [-] aantix|4 years ago|reply
Talk about being on center stage..
[+] [-] anderspitman|4 years ago|reply
It would be awesome to have essentially the same experience as using Slack, but also a server running on our lab's hardware that we can reboot/debug ourselves.
[+] [-] bengale|4 years ago|reply
[+] [-] aftbit|4 years ago|reply
[+] [-] whalesalad|4 years ago|reply
[+] [-] feanaro|4 years ago|reply
[+] [-] nate|4 years ago|reply
[+] [-] KingMachiavelli|4 years ago|reply
[+] [-] Cerium|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] abletonlive|4 years ago|reply
no thanks. I would never install zoom on any of my computers voluntarily.
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] legofr|4 years ago|reply
[+] [-] rythmshifter03|4 years ago|reply
[+] [-] tombert|4 years ago|reply
As it stands, while I'm not fully blocked from being productive, I am blocked for a lot of work that needs to get done.
[+] [-] shadowgovt|4 years ago|reply
Coupled with COVID work-from-home constraints, there's no practical way right now to get code reviewed for merge into the main codebase 'round these parts.
[+] [-] twistedpair|4 years ago|reply
[+] [-] gavnewalkar|4 years ago|reply
[+] [-] gadrev|4 years ago|reply
[+] [-] jillesvangurp|4 years ago|reply
They are responding on twitter that there's something up: https://twitter.com/SlackHQ/status/1496133311558737923
[+] [-] iso1631|4 years ago|reply
[+] [-] danhab99|4 years ago|reply
[+] [-] onion2k|4 years ago|reply
[+] [-] duderific|4 years ago|reply
[+] [-] chefandy|4 years ago|reply
[+] [-] skeeter2020|4 years ago|reply
[+] [-] epivosism|4 years ago|reply
* I want to find someone by name => ctrl+K
* I want to search with sane keywords => search, "from:me" "to:<channel name>" work.
* I want to remember someone I talked to recently => people you chat with show up in the list on the left
* I want to keep com channels organized as the ground changes => easily rename channels, favorites UI works well, channel grouping works right.
* I want to give a public emotional response to someone's statement => reactions
* I want to continue discussion of a point someone made, which may not be relevant to everyone in the channel => threads
* I want to edit a typo I just made => hit up on keyboard
It's got that quality Factorio has, where you can let yourself imagine it is the ideal product, and start expecting features you need to be there, rather than not bothering to explore because you think they won't be.
It is AMAZINGLY good at solving actual user stories around communication. I have lots of respect for their PMs.
[+] [-] anderspitman|4 years ago|reply
EDIT: See comments below for instructions on re-enabling Markdown mode.
[+] [-] zestyping|4 years ago|reply
* I used to be able to type a message containing "@channel" or "@here" and just hit Enter without looking, and 100% of the time it would be parsed correctly. Now it fails to detect the "@channel" or "@here" often enough that I have to look to see if it failed, then go back and retype that part of the message until it lights up. Once a week or so, I see a message posted by someone else that contains "@channel" in black, and they expected the message to notify everyone but it didn't.
* Recently Slack pushed everyone to switch from usernames to full names with spaces, and eliminated the entire concept of unique usernames. There were many annoying consequences to this, one of which is that when I type "@name", now I have to look to see if it highlighted, or look to interact with the drop-down menu, instead of just typing and knowing it will work.
* Searching also got worse for the same reason. When I type "@name" in the search bar, it NEVER lights up. For 90% of searches this means I can no longer type in a search query and just press Enter. I always have to look through the drop-down menu and either mouse-click or press down-arrow repeatedly to get to the thing I want.
[+] [-] jordache|4 years ago|reply
WTH? Why not?
[+] [-] throwdbaaway|4 years ago|reply
> A configuration change inadvertently lead to a sudden increase in activity on our database infrastructure. Due to this increased activity, the affected databases failed to serve incoming requests to connect to Slack. We introduced tighter rate limits on connection requests to reduce the load on the system. This meant that some people could not access Slack at all, but also that Slack would continue working for those who were already connected.
> Once the system had stabilized, we began lifting these rate limits to enable more connections to Slack. However, we moved too quickly and the increased activity affected the system again. We reinstated the rate limits and redirected some traffic to the database replicas to relieve the demand on our primary databases.
I wonder if their bottleneck is at vitess vttablet or at mysql.
[+] [-] Ansil849|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] typon|4 years ago|reply
[+] [-] dijit|4 years ago|reply
The degraded availability thing is because I run multiple servers for latency, normally an IRC outage is super clear, except with netsplits where IRC servers themselves unlink from each other. (I average about 3 of those a year and they last about 30s roughly).
It's not video though, so that doesn't answer your question thoroughly.
[+] [-] haroman|4 years ago|reply
[+] [-] irthomasthomas|4 years ago|reply
It's pretty confusing.
[+] [-] speeder|4 years ago|reply
I guess it was down so they could do this crappy update.
[+] [-] seanw444|4 years ago|reply
[+] [-] SteveNuts|4 years ago|reply
I'm putting my money on an expired certificate in certain client app versions.
[+] [-] pempem|4 years ago|reply