top | item 46741665

(no title)

incidentiq | 1 month ago

The "mental model that experienced operators carry in their heads" framing resonates. The real problem isn't lack of tools - it's that the knowledge is ephemeral. Senior SRE leaves, their context leaves with them. Incident happens at 3am, and the on-call person is essentially doing archaeology.

Two observations from similar tooling attempts I've seen:

1. The hardest part isn't generating the map - it's keeping it accurate. Every tool that promises "live view of what's running" eventually drifts from reality because infrastructure changes faster than discovery runs. The teams that made this work treated the map as the source of truth and pushed changes through it, not around it.

2. Re: your feedback about write access - the "prototype to production-ready AWS" use case is interesting. That's where the value of context is highest (greenfield) and the risk is lowest (nothing to break yet). Much easier trust equation than "let it modify my production K8s cluster."

How are you handling the drift problem? Auto-discovery polling, change events from cloud providers, or something else?

discuss

order

kennethops|1 month ago

>The real problem isn't lack of tools - it's that the knowledge is ephemeral. This 100% the problem. This is why we are trying to capture business context and attach it to the infra itself vs just keeping it in docs.

>How are you handling the drift problem? Auto-discovery polling, change events from cloud providers, or something else?

We built a pretty awesome approach to handling the drift problem. We do a combination of indexing, change even capture and then user behavior. So if a user is looking for a information we pull the live value first.