One of the most satisfying feature degradation steps I did with FastComments was making it so that if the DB went offline completely, the app would still function:
1. It auto restarts all workers in the cluster in "maintenance mode".
2. A "maintenance mode" message shows on the homepage.
3. The top 100 pages by comment volume will still render their comment threads, as a job on each edge node recalculates and stores this on disk periodically.
4. Logging in is disabled.
5. All db calls to the driver are stubbed out with mocks to prevent crashes.
6. Comments can still be posted and are added into an on-disk queue on each edge node.
7. When the system is back online the queue is processed (and stuff checked for spam etc like normal).
It's not perfect but it means in a lot of cases I can completely turn off the DB for a few minutes without panic. I haven't had to use it in over a year, though, and the DB doesn't really go down. But useful for upgrades.
built it on my couch during a Jurassic park marathon :P
Joining Google a few years ago, one thing I was impressed with is the amount of effort that goes into graceful degradation. For user facing services it gets quite granular, and is deeply integrated into the stack – from application layer to networking.
Previously I worked on a big web app at a growing startup, and it's probably the sort of thing I'd start adding in small ways from the early days. Being able to turn off unnecessary writes, turn down the rate of more expensive computation, turn down rates of traffic amplification, these would all have been useful levers in some of our outages.
it's really great to have such capabilities, but adding them has a cost where only few can afford. Cost in terms of investing in building those, which impacts your feature build velocity and the maintenance
Facebook makes over 300 requests for me just loading the main logged in page while showing me exactly 1 timeline item. Hovering my mouse over that item makes another 100 requests or so. Scrolling down loads another item at the cost of over 100 requests again. It's impressive in a perverse way just how inefficient they can be while managing to make it still work, and somewhat disturbing that their ads bring in enough money to make them extremely profitable despite it.
I can't comment on the numbers, but think of how many engineers work there and how many users Facebook, Whatsapp, Instagram have. Each engineer is adding new features and queries every day. You're going to get a lot of queries.
I think about 10 years ago when I was working there I checked the trace to load my own homepage. Just one page, just for myself, and there were 100,000 data fetches.
> Am I reading the second figure right? Facebook can do 130*10^6 queries/second == 130,000,000 queries/second?!
That sounds totally plausible to me.
Also keep in mind they didn't say what system this is. It's often true that 1 request to a frontend system becomes 1 each to 10 different backend services owned by different teams and then 20+ total to some database/storage layer many of them depend on. The qps at the bottom of the stack is in general a lot higher than the qps at the top, though with caching and static file requests and such this isn't a universal truth.
A custom JIT + language + web framework + DB + queues + orchestrator + hardware built to your precise specifications + DCs all over the world go a long way ;)
Off-topic but: I love the font on the website. At first I thought it was the classic Computer Modern font (used in LateX). But nope. Upon inspection of the stylesheet, it's https://edwardtufte.github.io/et-book/ which was a font designed by Dmitry Krasny, Bonnie Scranton, and Edward Tufte. The font was originally designed for his book Beautiful Evidence. But people showed interest in font, see the bulletin board on ET's website: https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=... Initially he was reluctant to go the trouble of releasing it digitally. But eventually he did make it available on GitHub.
They're the same thing or close to it. "Load shedding" might be a bit more general. A couple possible nuances:
* Perhaps "graceful feature degradation" as a choice of words is a way of noting there's immediate user impact (but less than ungracefully running out of capacity). "Load shedding" could also mean something less impactful, for example some cron job that updates some internal dashboard skipping a run.
* "feature degradation" might focus on how this works at the granularity of features, where load shedding might mean something like dropping request hedges / retries, or individual servers saying they're overloaded and the request should go elsewhere.
The situation is that A depends on B, but B is overloaded; if we allow B to do load shedding, we must also write A to gracefully degrade when B is not available.
Seems like whenever I log into FB lately it's pretty much always in a state of “graceful feature degradation”.
For example, as soon as I log in I see a bell icon in the upper right with a bright red circle containing an exact positive integer number of notifications. It practically screams “click here, you have urgent business”.
I can then leave the web page sitting there for any number of minutes, and no matter how long I wait, if I click on that notification icon it will take a good 20 seconds to load the list of new notifications. (This is on gigabit fiber in a major metro area, so not a plumbing issue.)
This is one of the great challenges of system engineering. Any slack you build into the system has a tendency to get used over time, but that means that if you don't exert some human discipline to have monitoring on your slack and treat it as at least a medium priority that your slack is being used up that your system will rapidly evolve (or devolve, if you prefer) into one that has single points of failure after all.
To give a super simple example, suppose you have a database that can transparently fail over to a backup, but it's so "transparent" that nobody even gets notified. Suppose the team even tests it and it proves to work well. The team will then believe that they are very well protected and tell all their customers and management all about how bulletproof their setup is, but if they don't notice that the primary database corrupted and permanently went down in month six because their systems just handle it so well, they'll actually be operating on a single database after all and just be one hiccup from failure.
One of the jobs of an ethical engineer is to make sure management doesn't just say "it's OK, the site is working, forget about it and work on something else" without some appropriate amount of pushback, which you can ground on the fact that sure, they're saying to ignore it now, but when the second DB goes down and the site goes down they sure won't be defending you with "oh, but I told the engineering team to ignore the alerts and keep delivering features so it's really my fault and not theirs the site went down".
At Facebook's scale, something will always be in a state of degradation. It's just a fact of life.
Have you tried navigating the website using a web proxy (Charles, Burp Suite, or similar tool) to intercept the HTTP request(s) in order to replay them yourself multiple times to see if the latency is consistent? It’d be interesting to discover that the delay is fabricated using the front-end code or if the back-end server is really the problem. I don’t use Facebook but I asked a friend just now and the response time for the notifications panel to appear is between 500ms-2000ms, which is relatively fast for web interactions.
Without being able to verify, I would assume it’s designed to behave in this way. The longer you wait the more anticipation builds up, the more gratifying it becomes.
> it will take a good 20 seconds to load the list of new notifications.
Same thing here. Thought it was my (relatively much) slower Internet connection, or maybe that I had something "wrong" (what exactly that might have been, I don't know).
Because the HN crowd likes learning new things: if `enabled` is a nullable boolean in C# (i.e. has type `bool?`) then this check must indeed be written this way, to avoid confusing null with false.
winrid|2 years ago
1. It auto restarts all workers in the cluster in "maintenance mode".
2. A "maintenance mode" message shows on the homepage.
3. The top 100 pages by comment volume will still render their comment threads, as a job on each edge node recalculates and stores this on disk periodically.
4. Logging in is disabled.
5. All db calls to the driver are stubbed out with mocks to prevent crashes.
6. Comments can still be posted and are added into an on-disk queue on each edge node.
7. When the system is back online the queue is processed (and stuff checked for spam etc like normal).
It's not perfect but it means in a lot of cases I can completely turn off the DB for a few minutes without panic. I haven't had to use it in over a year, though, and the DB doesn't really go down. But useful for upgrades.
built it on my couch during a Jurassic park marathon :P
tsss|2 years ago
danpalmer|2 years ago
Previously I worked on a big web app at a growing startup, and it's probably the sort of thing I'd start adding in small ways from the early days. Being able to turn off unnecessary writes, turn down the rate of more expensive computation, turn down rates of traffic amplification, these would all have been useful levers in some of our outages.
tuyguntn|2 years ago
Banditoz|2 years ago
ndriscoll|2 years ago
bagels|2 years ago
storyinmemo|2 years ago
scottlamb|2 years ago
That sounds totally plausible to me.
Also keep in mind they didn't say what system this is. It's often true that 1 request to a frontend system becomes 1 each to 10 different backend services owned by different teams and then 20+ total to some database/storage layer many of them depend on. The qps at the bottom of the stack is in general a lot higher than the qps at the top, though with caching and static file requests and such this isn't a universal truth.
gaogao|2 years ago
sebzim4500|2 years ago
avery17|2 years ago
bdd|2 years ago
IncreasePosts|2 years ago
reissbaker|2 years ago
Thaxll|2 years ago
AlienRobot|2 years ago
sonicanatidae|2 years ago
dang|2 years ago
Defcon: Preventing Overload with Graceful Feature Degradation - https://news.ycombinator.com/item?id=36923049 - July 2023 (1 comment)
mrb|2 years ago
gillh|2 years ago
https://github.com/fluxninja/aperture
jedberg|2 years ago
We built a similar tool at Netflix but the degradations could be both manual and automatic.
packetslave|2 years ago
The manual part of Defcon is more "holy crap, we lost a datacenter and the whole site is melting, turn stuff off to bring the load down ASAP"
mikerg87|2 years ago
scottlamb|2 years ago
* Perhaps "graceful feature degradation" as a choice of words is a way of noting there's immediate user impact (but less than ungracefully running out of capacity). "Load shedding" could also mean something less impactful, for example some cron job that updates some internal dashboard skipping a run.
* "feature degradation" might focus on how this works at the granularity of features, where load shedding might mean something like dropping request hedges / retries, or individual servers saying they're overloaded and the request should go elsewhere.
kqr|2 years ago
The situation is that A depends on B, but B is overloaded; if we allow B to do load shedding, we must also write A to gracefully degrade when B is not available.
unknown|2 years ago
[deleted]
velcrovan|2 years ago
For example, as soon as I log in I see a bell icon in the upper right with a bright red circle containing an exact positive integer number of notifications. It practically screams “click here, you have urgent business”.
I can then leave the web page sitting there for any number of minutes, and no matter how long I wait, if I click on that notification icon it will take a good 20 seconds to load the list of new notifications. (This is on gigabit fiber in a major metro area, so not a plumbing issue.)
jerf|2 years ago
To give a super simple example, suppose you have a database that can transparently fail over to a backup, but it's so "transparent" that nobody even gets notified. Suppose the team even tests it and it proves to work well. The team will then believe that they are very well protected and tell all their customers and management all about how bulletproof their setup is, but if they don't notice that the primary database corrupted and permanently went down in month six because their systems just handle it so well, they'll actually be operating on a single database after all and just be one hiccup from failure.
One of the jobs of an ethical engineer is to make sure management doesn't just say "it's OK, the site is working, forget about it and work on something else" without some appropriate amount of pushback, which you can ground on the fact that sure, they're saying to ignore it now, but when the second DB goes down and the site goes down they sure won't be defending you with "oh, but I told the engineering team to ignore the alerts and keep delivering features so it's really my fault and not theirs the site went down".
At Facebook's scale, something will always be in a state of degradation. It's just a fact of life.
guessmyname|2 years ago
philippta|2 years ago
paganel|2 years ago
Same thing here. Thought it was my (relatively much) slower Internet connection, or maybe that I had something "wrong" (what exactly that might have been, I don't know).
Arainach|2 years ago
iraqmtpizza|2 years ago
unknown|2 years ago
[deleted]
OtherShrezzing|2 years ago
This could use some light-touch code reviewing
kqr|2 years ago
dvhh|2 years ago
bmacho|2 years ago
unknown|2 years ago
[deleted]
unknown|2 years ago
[deleted]
unknown|2 years ago
[deleted]
unknown|2 years ago
[deleted]
unknown|2 years ago
[deleted]
sara44444444|2 years ago
[deleted]