top | item 46741660

(no title)

incidentiq | 1 month ago

Been on-call across several orgs. To answer your questions:

1. "AI SRE" useful or hype? Useful in specific contexts, but the trust barrier is real. Most on-call engineers are skeptical of AI suggestions during incidents because the cost of a wrong recommendation at 3am is high. That said, the pain of digging through logs and finding relevant context is also real.

2. Where it helps: The biggest wins are in "pre-work" - surfacing relevant past incidents before you start investigating, correlating alerts that are likely related, and summarizing what changed recently. Reducing the "context gathering" phase which often eats 30%+ of incident time.

3. Trust requirements: For me to trust it: - Show confidence levels and your reasoning. "Here's what I found and why" beats "do this." - Be a copilot that accelerates my investigation, not one that acts on my behalf. - Get the easy stuff 100% right before attempting the hard stuff. If log correlation is wrong on obvious patterns, I won't trust root cause suggestions.

The RAPTOR approach for runbooks is interesting - the "loss of context in chunked RAG" problem is real for long-form incident docs. How do you handle cases where relevant context spans multiple documents (runbook references an architecture doc)?

discuss

order

No comments yet.