Defcon: Preventing overload with graceful feature degradation (2023)

winrid|2 years ago

One of the most satisfying feature degradation steps I did with FastComments was making it so that if the DB went offline completely, the app would still function:

1. It auto restarts all workers in the cluster in "maintenance mode".

2. A "maintenance mode" message shows on the homepage.

3. The top 100 pages by comment volume will still render their comment threads, as a job on each edge node recalculates and stores this on disk periodically.

4. Logging in is disabled.

5. All db calls to the driver are stubbed out with mocks to prevent crashes.

6. Comments can still be posted and are added into an on-disk queue on each edge node.

7. When the system is back online the queue is processed (and stuff checked for spam etc like normal).

It's not perfect but it means in a lot of cases I can completely turn off the DB for a few minutes without panic. I haven't had to use it in over a year, though, and the DB doesn't really go down. But useful for upgrades.

built it on my couch during a Jurassic park marathon :P

tsss|2 years ago

Sounds like you're just failing over to a custom database.

danpalmer|2 years ago

Joining Google a few years ago, one thing I was impressed with is the amount of effort that goes into graceful degradation. For user facing services it gets quite granular, and is deeply integrated into the stack – from application layer to networking.

Previously I worked on a big web app at a growing startup, and it's probably the sort of thing I'd start adding in small ways from the early days. Being able to turn off unnecessary writes, turn down the rate of more expensive computation, turn down rates of traffic amplification, these would all have been useful levers in some of our outages.

tuyguntn|2 years ago

it's really great to have such capabilities, but adding them has a cost where only few can afford. Cost in terms of investing in building those, which impacts your feature build velocity and the maintenance

Banditoz|2 years ago

Am I reading the second figure right? Facebook can do 130*10^6 queries/second == ‭130,000,000‬ queries/second?!

ndriscoll|2 years ago

Facebook makes over 300 requests for me just loading the main logged in page while showing me exactly 1 timeline item. Hovering my mouse over that item makes another 100 requests or so. Scrolling down loads another item at the cost of over 100 requests again. It's impressive in a perverse way just how inefficient they can be while managing to make it still work, and somewhat disturbing that their ads bring in enough money to make them extremely profitable despite it.

bagels|2 years ago

I can't comment on the numbers, but think of how many engineers work there and how many users Facebook, Whatsapp, Instagram have. Each engineer is adding new features and queries every day. You're going to get a lot of queries.

storyinmemo|2 years ago

I think about 10 years ago when I was working there I checked the trace to load my own homepage. Just one page, just for myself, and there were 100,000 data fetches.

scottlamb|2 years ago

> Am I reading the second figure right? Facebook can do 130*10^6 queries/second == ‭130,000,000‬ queries/second?!

That sounds totally plausible to me.

Also keep in mind they didn't say what system this is. It's often true that 1 request to a frontend system becomes 1 each to 10 different backend services owned by different teams and then 20+ total to some database/storage layer many of them depend on. The qps at the bottom of the stack is in general a lot higher than the qps at the top, though with caching and static file requests and such this isn't a universal truth.

gaogao|2 years ago

Those queries are probably mostly memcache hits, though of course with distributed cache invalidation and consistency fun

sebzim4500|2 years ago

Sounds plausible. There are probably many queries required to display a page and Facebook has 2 billion daily active users.

avery17|2 years ago

Whats with people lately writing 10^6 instead of 1 million. Its not that big that we need exponents to get involved.

bdd|2 years ago

Yes. And that was 4 years ago. Must add that figure does NOT include static asset serving path.

IncreasePosts|2 years ago

I forgot how to count that low.

reissbaker|2 years ago

A custom JIT + language + web framework + DB + queues + orchestrator + hardware built to your precise specifications + DCs all over the world go a long way ;)

Thaxll|2 years ago

We're close to 1 million servers, not 12 racks in a DC.

AlienRobot|2 years ago

iirc Facebook has 3 billion users, so that sounds plausible.

sonicanatidae|2 years ago

Yeah, they allocated ALL of the ram to their DB servers. lol

dang|2 years ago

Discussed (a tiny bit) at the time:

Defcon: Preventing Overload with Graceful Feature Degradation - https://news.ycombinator.com/item?id=36923049 - July 2023 (1 comment)

mrb|2 years ago

Off-topic but: I love the font on the website. At first I thought it was the classic Computer Modern font (used in LateX). But nope. Upon inspection of the stylesheet, it's https://edwardtufte.github.io/et-book/ which was a font designed by Dmitry Krasny, Bonnie Scranton, and Edward Tufte. The font was originally designed for his book Beautiful Evidence. But people showed interest in font, see the bulletin board on ET's website: https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=... Initially he was reluctant to go the trouble of releasing it digitally. But eventually he did make it available on GitHub.

gillh|2 years ago

Anyone interested in load shedding and graceful degradation with request prioritization should check out the Aperture OSS project.

https://github.com/fluxninja/aperture

jedberg|2 years ago

I'm surprised they don't have automated degradation (or at least the article implies that it must be operator initiated).

We built a similar tool at Netflix but the degradations could be both manual and automatic.

packetslave|2 years ago

There's definitely automated degradation at smaller scale ("if $random_feature's backend times out, don't show it", etc.).

The manual part of Defcon is more "holy crap, we lost a datacenter and the whole site is melting, turn stuff off to bring the load down ASAP"

mikerg87|2 years ago

Isn't this referred to as Load Shedding in some circles? If its not, can someone explain how its different?

scottlamb|2 years ago

They're the same thing or close to it. "Load shedding" might be a bit more general. A couple possible nuances:

* Perhaps "graceful feature degradation" as a choice of words is a way of noting there's immediate user impact (but less than ungracefully running out of capacity). "Load shedding" could also mean something less impactful, for example some cron job that updates some internal dashboard skipping a run.

* "feature degradation" might focus on how this works at the granularity of features, where load shedding might mean something like dropping request hedges / retries, or individual servers saying they're overloaded and the request should go elsewhere.

kqr|2 years ago

This is the other side of the load shedding coin.

The situation is that A depends on B, but B is overloaded; if we allow B to do load shedding, we must also write A to gracefully degrade when B is not available.

unknown|2 years ago

[deleted]

velcrovan|2 years ago

Seems like whenever I log into FB lately it's pretty much always in a state of “graceful feature degradation”.

For example, as soon as I log in I see a bell icon in the upper right with a bright red circle containing an exact positive integer number of notifications. It practically screams “click here, you have urgent business”.

I can then leave the web page sitting there for any number of minutes, and no matter how long I wait, if I click on that notification icon it will take a good 20 seconds to load the list of new notifications. (This is on gigabit fiber in a major metro area, so not a plumbing issue.)

jerf|2 years ago

This is one of the great challenges of system engineering. Any slack you build into the system has a tendency to get used over time, but that means that if you don't exert some human discipline to have monitoring on your slack and treat it as at least a medium priority that your slack is being used up that your system will rapidly evolve (or devolve, if you prefer) into one that has single points of failure after all.

To give a super simple example, suppose you have a database that can transparently fail over to a backup, but it's so "transparent" that nobody even gets notified. Suppose the team even tests it and it proves to work well. The team will then believe that they are very well protected and tell all their customers and management all about how bulletproof their setup is, but if they don't notice that the primary database corrupted and permanently went down in month six because their systems just handle it so well, they'll actually be operating on a single database after all and just be one hiccup from failure.

One of the jobs of an ethical engineer is to make sure management doesn't just say "it's OK, the site is working, forget about it and work on something else" without some appropriate amount of pushback, which you can ground on the fact that sure, they're saying to ignore it now, but when the second DB goes down and the site goes down they sure won't be defending you with "oh, but I told the engineering team to ignore the alerts and keep delivering features so it's really my fault and not theirs the site went down".

At Facebook's scale, something will always be in a state of degradation. It's just a fact of life.

guessmyname|2 years ago

Have you tried navigating the website using a web proxy (Charles, Burp Suite, or similar tool) to intercept the HTTP request(s) in order to replay them yourself multiple times to see if the latency is consistent? It’d be interesting to discover that the delay is fabricated using the front-end code or if the back-end server is really the problem. I don’t use Facebook but I asked a friend just now and the response time for the notifications panel to appear is between 500ms-2000ms, which is relatively fast for web interactions.

philippta|2 years ago

Without being able to verify, I would assume it’s designed to behave in this way. The longer you wait the more anticipation builds up, the more gratifying it becomes.

paganel|2 years ago

> it will take a good 20 seconds to load the list of new notifications.

Same thing here. Thought it was my (relatively much) slower Internet connection, or maybe that I had something "wrong" (what exactly that might have been, I don't know).

Arainach|2 years ago

The initial render of Facebook's UI slows dramatically (I suspect but cannot prove intentionally) if you have adblockers/uBlock Origin/etc.

iraqmtpizza|2 years ago

at least once youtube slowed to a crawl until I cleared the cookie

unknown|2 years ago

[deleted]

OtherShrezzing|2 years ago

> if (disableCommentsRanking.enabled == False)

This could use some light-touch code reviewing

kqr|2 years ago

Because the HN crowd likes learning new things: if `enabled` is a nullable boolean in C# (i.e. has type `bool?`) then this check must indeed be written this way, to avoid confusing null with false.

dvhh|2 years ago

Some could argue it would be for illustration purpose, and not actual production code

bmacho|2 years ago

It looks funny, but I think it's actually good, and arguably the best possible form of it.

unknown|2 years ago

[deleted]

unknown|2 years ago

[deleted]

unknown|2 years ago

[deleted]

unknown|2 years ago

[deleted]

unknown|2 years ago

[deleted]

sara44444444|2 years ago

[deleted]

95 comments