top | item 46868568

(no title)

A worrying choice of words.

"Losing sleep" implies an actual problem, which in turn implies that the mentioned mitigations and similar ones have not been applied (at least not properly) for dire reasons that are likely to be a more important problem than bad QoS.

"Infrastructure" implies an expectation that you deploy something external to the troubled application: there is a defective, presumably simplistic application architecture, and fixing it is not an option. This puts you in an awkward position: someone else is incompetent or unreasonable, but the responsibility for keeping their dumpster fire running falls on you.

discuss

rjpruitt16|26 days ago

Fair pushback — to clarify, I’m not assuming incompetence or suggesting infra should paper over bad architecture.

By “losing sleep” I really mean on-call fatigue during partial outages — the class of incidents where backoff, shedding, and breakers exist, but retry amplification, shared rate limits, or degraded dependencies still cause noisy pages and prolonged recovery.

I’m trying to understand how teams coordinate retries and backpressure across many independent clients/services when refactors aren’t immediately available, not replace good architecture or take ownership of someone else’s system.

If you’ve seen patterns that consistently avoid that on-call pain at scale, I’d genuinely love to learn from them.