top | item 46872248

(no title)

Fair pushback — to clarify, I’m not assuming incompetence or suggesting infra should paper over bad architecture.

By “losing sleep” I really mean on-call fatigue during partial outages — the class of incidents where backoff, shedding, and breakers exist, but retry amplification, shared rate limits, or degraded dependencies still cause noisy pages and prolonged recovery.

I’m trying to understand how teams coordinate retries and backpressure across many independent clients/services when refactors aren’t immediately available, not replace good architecture or take ownership of someone else’s system.

If you’ve seen patterns that consistently avoid that on-call pain at scale, I’d genuinely love to learn from them.

discuss

No comments yet.