Id gonna break a nerve and say most orgs overengineer observability. There's the whole topology of otel tools, Prometheus tools and bunch of Long term storage / querying solutions. Very complicated tracing setups. All these are fine if you have a team for maintaining observability only. But your avg product development org can sacrifice most of it and do with proper logging with a request context, plus some important service level metrics + grafana + alarms.Problem with all these above tools is that, they all seem like essential features to have but once you have the whole topology of 50 half baked CNCF containers set up in "production" shit starts to break in very mysterious ways and also these observability products tend to cost a lot.
datadrivenangel|1 year ago
lincolnq|1 year ago
But maybe it's worth it. or at least, the good ones would be worth it. I can imagine great metadata (and platforms to query and explore it) saves more engineering time than it costs in server time. So to me this ratio isn't that material, even though it looks a little weird.
fishtoaster|1 year ago
1. Set up basic logging (now I just use otel events)
2. Make it structured logging (Get that for free with otel events)
3. Add request contexts that's sent along with each log (Also free with otel)
4. Manually set up tracing ids in my codebase and configure it in my tooling (all free with otel spans)
Really, I was expecting to wind up having to get really into the new observability philosophy to get value out of it, but I found myself really loving this setup with minimal work and minimal koolade-drinking. I'll probably do something like this over "logs, request context, metrics, and alarms" at future startups.
JoshTriplett|1 year ago
I wish "span start" and "span end" were just independent events, and OTel tools handled and presented unfinished spans or events that don't appear within a span.
nicbou|1 year ago
I would like to make sure that a few recurrent jobs run fine, but by pushing a status instead of polling it.
jauntywundrkind|1 year ago
Where-as tracing instrumentation is fantastically good at showing where response time is being spent, showing what's getting hit. And it comes with no developer cost; the automatic instrumentation runs & does it all.
Ideally you also throw in some. additional tags onto the root-entry-span or current span. That takes some effort. But then it should be consistently & widely available & visible.
Tracing is very hard to get orgs culturally on board with. And there are some operational challenges but personally I think you are way over-selling how hard it is... There are just a colossal collection of softwares that serve as really good export destinations, and your team can probably already operate one or two quite well.
It does get a lot more complex if you want longer term storage. Personally I'm pro mixing systems observability with product performance tracking stuff, so yes you do need to keep some data indefinitely. And that can be hugely problematic: either trying to juggle storage and querying for infinitely growing data, or building systems to aggregate & persist the data & derived metrics, you need while getting rid of the base data.
But I just can't emphasize enough how badly most orgs are at logs and how not worth anyone's time it so to invest in something manual like that that offers so much less than the alternative (traces).
spimmy|1 year ago
also big agree that most places are so, so, so messy and bad about doing logs. :( for years, i refused to even use the term "logs" because all the assumption i wanted people to make were the opposite of the assumptions people bring to logs: unstructured, messy, spray and pray, etc.