(no title)
HelloNurse | 26 days ago
"Losing sleep" implies an actual problem, which in turn implies that the mentioned mitigations and similar ones have not been applied (at least not properly) for dire reasons that are likely to be a more important problem than bad QoS.
"Infrastructure" implies an expectation that you deploy something external to the troubled application: there is a defective, presumably simplistic application architecture, and fixing it is not an option. This puts you in an awkward position: someone else is incompetent or unreasonable, but the responsibility for keeping their dumpster fire running falls on you.
rjpruitt16|26 days ago
By “losing sleep” I really mean on-call fatigue during partial outages — the class of incidents where backoff, shedding, and breakers exist, but retry amplification, shared rate limits, or degraded dependencies still cause noisy pages and prolonged recovery.
I’m trying to understand how teams coordinate retries and backpressure across many independent clients/services when refactors aren’t immediately available, not replace good architecture or take ownership of someone else’s system.
If you’ve seen patterns that consistently avoid that on-call pain at scale, I’d genuinely love to learn from them.