(no title)
rtuin | 1 year ago
I’ve been pushing the use of Datadog for years but their pricing is out of control for anyone between mid size company and large enterprises. So as years passed and OpenTelemetry API’s and SDK’s stabilized it became our standard for application observability.
To be honest the documentation could be better overall and the onboarding docs differ per programming language, which is not ideal.
My current team is on a NodeJS/Typescript stack and we’ve created a set of packages and an example Grafana stack to get started with OpenTelemetry real quick. Maybe it’s useful to anyone here: https://github.com/zonneplan/open-telemetry-js
saurik|1 year ago
Wait... so, the problem is that everyone makes it super easy, and so this product solves that by being complicated? ;P
to11mtm|1 year ago
Also, per the hackiness, it tends to have visible perf impact. I know with dynatrace agent we had 0-1MS metrics pop up to 5-10ms (this service had a lot of traffic so it added up) and I'm pretty sure on .NET side there's issues around general performance of OTEL. I also know some of the work/'fun' colleagues have had to endure to make OTEL performant for their libs, in spite of the fact it was a message passing framework where that should be fairly simple...
to11mtm|1 year ago
Not a fan of datadog vs just good metric collection. OTOH while I see the value of OTEL vs what I prefer to do... in theory.
My biggest problem with all of the APM vendors, once you have kernel hooks via your magical agent all sorts of fun things come up that developers can't explain.
My favorite example: At another shop we eventually adopted Dynatrace. Thankfully our app already had enough built-in metrics that a lead SRE considered it a 'model' for how to do instrumentation... I say that because, as soon as Dynatrace agents got installed on the app hosts, we started having various 'heisenbugs' requiring node restarts as well as a directly measured drop in performance. [0]
Ironically, the metrics saved us from grief, yet nobody had an idea how to fix it. ;_;
[0] - Curiously, the 'worst' one was MSSQL failovers on update somehow polluting our ADO.NET connection pools in a bad way...
richbell|1 year ago
Our containers regularly fail due vague LD_PRELOAD errors. Nobody has invested the time to figure out what the issue is because it usually goes away after restarting; the issue is intermittent and non-blocking, yet constant.
It's miserable.
EdwardDiego|1 year ago