top | item 42659198

(no title)

rtuin | 1 year ago

Otel seems complicated because different observability vendors make implementing observability super easy with their proprietary SDK’s, agents and API’s. This is what Otel wants to solve and I think the people behind it are doing a great job. Also kudos to grafana for adopting OpenTelemetry as a first class citizen of their ecosystem.

I’ve been pushing the use of Datadog for years but their pricing is out of control for anyone between mid size company and large enterprises. So as years passed and OpenTelemetry API’s and SDK’s stabilized it became our standard for application observability.

To be honest the documentation could be better overall and the onboarding docs differ per programming language, which is not ideal.

My current team is on a NodeJS/Typescript stack and we’ve created a set of packages and an example Grafana stack to get started with OpenTelemetry real quick. Maybe it’s useful to anyone here: https://github.com/zonneplan/open-telemetry-js

discuss

saurik|1 year ago

> Otel seems complicated because different observability vendors make implementing observability super easy with their proprietary SDK’s, agents and API’s. This is what Otel wants to solve and I think the people behind it are doing a great job.

Wait... so, the problem is that everyone makes it super easy, and so this product solves that by being complicated? ;P

to11mtm|1 year ago

The problem is that they make it super easy in very hacky ways and it becomes painful to improve things without startup money.

Also, per the hackiness, it tends to have visible perf impact. I know with dynatrace agent we had 0-1MS metrics pop up to 5-10ms (this service had a lot of traffic so it added up) and I'm pretty sure on .NET side there's issues around general performance of OTEL. I also know some of the work/'fun' colleagues have had to endure to make OTEL performant for their libs, in spite of the fact it was a message passing framework where that should be fairly simple...

to11mtm|1 year ago

> I’ve been pushing the use of Datadog for years but their pricing is out of control for anyone between mid size company and large enterprises

Not a fan of datadog vs just good metric collection. OTOH while I see the value of OTEL vs what I prefer to do... in theory.

My biggest problem with all of the APM vendors, once you have kernel hooks via your magical agent all sorts of fun things come up that developers can't explain.

My favorite example: At another shop we eventually adopted Dynatrace. Thankfully our app already had enough built-in metrics that a lead SRE considered it a 'model' for how to do instrumentation... I say that because, as soon as Dynatrace agents got installed on the app hosts, we started having various 'heisenbugs' requiring node restarts as well as a directly measured drop in performance. [0]

Ironically, the metrics saved us from grief, yet nobody had an idea how to fix it. ;_;

[0] - Curiously, the 'worst' one was MSSQL failovers on update somehow polluting our ADO.NET connection pools in a bad way...

richbell|1 year ago

> I say that because, as soon as Dynatrace agents got installed on the app hosts, we started having various 'heisenbugs' requiring node restarts

Our containers regularly fail due vague LD_PRELOAD errors. Nobody has invested the time to figure out what the issue is because it usually goes away after restarting; the issue is intermittent and non-blocking, yet constant.

It's miserable.

EdwardDiego|1 year ago

Thank you! I'm very interested in that.