top | item 46685040

Show HN: Former Cloudflare SRE building OpsCompanion a live map of whats running

4 points| kennethops | 1 month ago

Hey HN, I’m Kenneth. I spent several years as a Senior SRE at Cloudflare.

One thing that became painfully clear over time is that most outages, security issues, and compliance fire drills don’t come from a lack of tools. They come from missing context. People don’t know what’s running, how things connect, or what changed recently, especially once systems sprawl across clouds, repos, and teams.

That’s why I’m building OpsCompanion.

The goal is simple: keep a live, shared picture of what’s actually running and how it fits together.

OpsCompanion helps engineers:

See a live, visual map of services, infrastructure, and dependencies

Answer “what changed?” without digging through five tools, Slack threads, or outdated docs

Preserve operational context so the next person on call isn’t starting from zero

This isn’t about adding more logs or alerts, or slapping AI on top of existing dashboards. It’s about capturing the mental model experienced operators carry in their heads and keeping it shared and up to date.

It’s still early, and there are rough edges. I’ve opened it up to a small group of engineers who work close to production so I can get honest feedback. If it’s useful, great. If not, I genuinely want to understand why and what would make it better.

You can try it here: https://opscompanion.ai/?utm_source=hn&utm_medium=show_hn&ut...

I’ll be around in the comments. Happy to answer technical questions, hear skepticism, get a bit roasted, or talk about what actually breaks in real systems.

4 comments

order

incidentiq|1 month ago

The "mental model that experienced operators carry in their heads" framing resonates. The real problem isn't lack of tools - it's that the knowledge is ephemeral. Senior SRE leaves, their context leaves with them. Incident happens at 3am, and the on-call person is essentially doing archaeology.

Two observations from similar tooling attempts I've seen:

1. The hardest part isn't generating the map - it's keeping it accurate. Every tool that promises "live view of what's running" eventually drifts from reality because infrastructure changes faster than discovery runs. The teams that made this work treated the map as the source of truth and pushed changes through it, not around it.

2. Re: your feedback about write access - the "prototype to production-ready AWS" use case is interesting. That's where the value of context is highest (greenfield) and the risk is lowest (nothing to break yet). Much easier trust equation than "let it modify my production K8s cluster."

How are you handling the drift problem? Auto-discovery polling, change events from cloud providers, or something else?

kennethops|1 month ago

>The real problem isn't lack of tools - it's that the knowledge is ephemeral. This 100% the problem. This is why we are trying to capture business context and attach it to the infra itself vs just keeping it in docs.

>How are you handling the drift problem? Auto-discovery polling, change events from cloud providers, or something else?

We built a pretty awesome approach to handling the drift problem. We do a combination of indexing, change even capture and then user behavior. So if a user is looking for a information we pull the live value first.

shukantpal|1 month ago

In your pilots so far, what's the feedback you've gotten?

kennethops|1 month ago

So far the feedback has clustered around a few themes:

People want it to be significantly more proactive over time, things like root cause analysis, security-style probing, or guided investigations rather than just visibility.

There’s interest in going deeper on telemetry and using it to surface higher-level insights, not just raw data or links out to other tools.

A lot of people ask whether it can eventually write to environments. The direction that’s resonated most is doing this first for new or greenfield environments. For example, going from a prototype to a production-ready AWS setup in a more agentic way. For existing environments, trust and safety are still the gating factors.

My takeaway is that read-only context earns trust first, and write access has to be very deliberate and staged.