top | item 46503610

(no title)

tech_ken | 1 month ago

> Observability made us very good at producing signals, but only slightly better at what comes after: interpreting them, generating insights, and translating those insights into reliability.

I'm a data professional who's kind of SRE adjacent for a big corpo's infra arm and wow does this post ring true for me. I'm tempted to just say "well duh, producing telemetry was always the low hanging fruit, it's the 'generating insights' part that's truly hard", but I think that's too pithy. My more reflective take is that generating reliability from data lives in a weird hybrid space of domain knowledge and data management, and most orgs headcount strategy don't account for this. SWEs pretend that data scientists are just SQL jockeys minutes from being replaced by an LLM agent; data scientists pretend like stats is the only "hard" thing and all domain knowledge can be learned with sufficient motivation and documentation. In reality I think both are equally hard, it's rare that you find someone who can do both, and that doing both is really what's required for true "observability".

At a high level I'd say there are three big areas where orgs (or at least my org) tend to fall short:

* extremely sound data engineering and org-wide normalization (to support correlating diverse signals with highly disparate sources during root-cause)

* telemetry that's truly capable of capturing the problem (ie. it's not helpful to monitor disk usage if CPU is the bottleneck)

* true 'sleuths' who understand how to leverage the first two things to produce insights, and have the org-wide clout to get those insights turned into action

I think most orgs tend to pick two of these, and cheap out on the third, and the result is what you describe in your post. Maybe they have some rockstar engineers who understand how to overcome the data ecosystem shortcomings to produce a root-cause analysis, or maybe they pay through the nose for some telemetry/dashboard platform that they then hand over to contract workers who brute-force reliability through tons of work hours. Even when they do create dedicated reliability teams, it seems like they are more often than not hamstrung by not having any leverage with the people who actually build the product. And when everything is a distributed system it might actually be 5 or 6 teams who you have no leverage with, so even if you win over 1 or 2 critical POCs you're left with an incomplete patchwork of telemetry systems which meet the owning team's (teams') needs and nothing else.

All this to say that I think reliability is still ultimately an incentive problem. You can have the best observability tooling in the world, but if don't have folks at every level of the org who understand (a) what 'reliable' concretely looks like for your product and (b) have the power to effect necessary changes then you're going to get a lot of churn with little benefit.

discuss

shcallaway|1 month ago

This is a super insightful comment & there is a bunch that I want to respond to but I can't do it all neatly in one comment. Hahaha

I'll choose this point:

> reliability is still ultimately an incentive problem

This is a fascinating argument and it feels true.

Think about it. Why do companies give a shit about reliability at all? They only care b/c it impacts bottom line. If the app is "reliable enough" such that customers aren't complaining and churning, it makes sense that the company would not make further investments in reliability.

This same logic is true at all levels of the organization, but the signal gets weaker as you go down the chain. A department cares about reliability b/c it impacts the bottom line of the org, but that signal (revenue) is not directly and attributable to the department. This is even more true for a team, or an individual.

I think SLOs are, to some extent, a mechanism that is designed to mitigate this problem; they serve as stronger incentive signals for departments and teams.

donavanm|1 month ago

I'd +1 incentives, primarily P&L/revenue/customer acquisition/retention, with a small carve out for "culture." I've worked places, and for people, where the culture was to "do the right thing" or focus on user experience as the objective which influenced decisions like paying more (time and money) for better support. For the SDEs and line teams it wasnt about revenue or someone yelling at them, they just emulated the behavior they saw around them which led to better observability/introspection/reliable/support. Which, of course, we'd like to believe leads to long term to success and $$$$.

I also like the call out of SLOs (or OKR or SMART goals or whatever) as a mechanism to broadcast your priorities and improve visibility. BUT I've also worked places where they didnt work because the ultimate owner with a VP title didnt care or understand to buy in to it.

And of course theres the hazard of principal agent problems between those selling, buying, building, and running are probably different teams and may not have any meaningful overlap in directly responsible individual.

ghaff|1 month ago

It's a long running topic in a lot of areas. I remember back when data warehousing was the hot thing, collecting and cleaning all this data was supposed to be the key to insights that would unlock juicy profits. Basically didn't happen.

hommes-r|1 month ago

I would add that "extremely sound data engineering" is also necessary to make observability cost-effective. Some of these otel platforms can burn 10%-25% of your cloud budget to show you your logs. That is insane.