top | item 41204593

(no title)

Id gonna break a nerve and say most orgs overengineer observability. There's the whole topology of otel tools, Prometheus tools and bunch of Long term storage / querying solutions. Very complicated tracing setups. All these are fine if you have a team for maintaining observability only. But your avg product development org can sacrifice most of it and do with proper logging with a request context, plus some important service level metrics + grafana + alarms.

Problem with all these above tools is that, they all seem like essential features to have but once you have the whole topology of 50 half baked CNCF containers set up in "production" shit starts to break in very mysterious ways and also these observability products tend to cost a lot.

discuss

datadrivenangel|1 year ago

The ratio of 'metadata' to data is often hundreds or thousands to one, which translates to cost, especially if you're a using a licensed service. I've been at companies where the analytics and observability costs are 20x the actual cost of the application for cloud hosting. Datadog seems to have switched to revenue extraction in a way that would make oracle proud.

lincolnq|1 year ago

Is that 20x cost... actually bad though? (I mean, I know Datadog is bad. I used to use it and I hated its cost structure.)

But maybe it's worth it. or at least, the good ones would be worth it. I can imagine great metadata (and platforms to query and explore it) saves more engineering time than it costs in server time. So to me this ratio isn't that material, even though it looks a little weird.

fishtoaster|1 year ago

For what it's worth, I found it almost trivial to set up open telemetry and point it honeycomb. It took me an afternoon about a month ago for a medium-sized python web-app. I've found that I can replace a lot of tooling and manual work needed in the past. At previous startups it's usually like

1. Set up basic logging (now I just use otel events)

2. Make it structured logging (Get that for free with otel events)

3. Add request contexts that's sent along with each log (Also free with otel)

4. Manually set up tracing ids in my codebase and configure it in my tooling (all free with otel spans)

Really, I was expecting to wind up having to get really into the new observability philosophy to get value out of it, but I found myself really loving this setup with minimal work and minimal koolade-drinking. I'll probably do something like this over "logs, request context, metrics, and alarms" at future startups.

JoshTriplett|1 year ago

I've currently done this, and I'm seriously considering undoing it in favor of some other logging solution. My biggest reason: OpenTelemetry fundamentally doesn't handle events that aren't part of a span, and doesn't handle spans that don't close. So, if you crash, you don't get telemetry to help you debug the crash.

I wish "span start" and "span end" were just independent events, and OTel tools handled and presented unfinished spans or events that don't appear within a span.

nicbou|1 year ago

How would you underengineer it? What would be a barebones setup for observability at the scale of one person with a few servers running at most a dozen different scripts?

I would like to make sure that a few recurrent jobs run fine, but by pushing a status instead of polling it.

jauntywundrkind|1 year ago

I just find logs to be infuriatingly inconsistent & poorly done. What gets logged is arbitrary as hell & has such poor odds of showing what happened.

Where-as tracing instrumentation is fantastically good at showing where response time is being spent, showing what's getting hit. And it comes with no developer cost; the automatic instrumentation runs & does it all.

Ideally you also throw in some. additional tags onto the root-entry-span or current span. That takes some effort. But then it should be consistently & widely available & visible.

Tracing is very hard to get orgs culturally on board with. And there are some operational challenges but personally I think you are way over-selling how hard it is... There are just a colossal collection of softwares that serve as really good export destinations, and your team can probably already operate one or two quite well.

It does get a lot more complex if you want longer term storage. Personally I'm pro mixing systems observability with product performance tracking stuff, so yes you do need to keep some data indefinitely. And that can be hugely problematic: either trying to juggle storage and querying for infinitely growing data, or building systems to aggregate & persist the data & derived metrics, you need while getting rid of the base data.

But I just can't emphasize enough how badly most orgs are at logs and how not worth anyone's time it so to invest in something manual like that that offers so much less than the alternative (traces).

spimmy|1 year ago

HUGE +1 to mixing systems observability with product data. this is an oft-missed aspect of observability 2.0 that is increasingly critical. all of the interesting questions in software are some combination and conjunction of systems, app, and business data.

also big agree that most places are so, so, so messy and bad about doing logs. :( for years, i refused to even use the term "logs" because all the assumption i wanted people to make were the opposite of the assumptions people bring to logs: unstructured, messy, spray and pray, etc.