The problem with OpenTelemetry

[+] codereflection|1 year ago|reply

I understand what the author is saying, but vendor lock-in with closed-source observability platforms is a significant challenge, especially for large organizations. When you instrument hundreds or thousands of applications with a specific tool, like the Datadog Agent, disentangling from that tool becomes nearly impossible without a massive investment of engineering time. In the Platform Engineering professional services space, we see this problem frequently. Enterprises are growing tired of big observability platform lock-in, especially when it comes to Datadog's opaque nature of your spend on their products, for example.

One of the promises of OTEL is that it allows organizations to replace vendor-specific agents with OTEL collectors, allowing the flexibility of the end observability platform. When used with an observability pipeline (such as EdgeDelta or Cribl), you can re-process collected telemetry data and send it to another platform, like Splunk, if needed. Consequently, switching from one observability platform to another becomes a bit less of a headache. Ironically, even Splunk recognizes this and has put substantial support behind the OTEL standard.

OTEL is far from perfect, and maybe some of these goals are a bit lofty, but I can say that many large organizations are adopting OTEL for these reasons.

[+] zeeg|1 year ago|reply

I totally agree I just wish we could do it in a way that doesn’t try to lump every problem into the same bucket. I don’t see what it achieves personally, and I think it’s limiting the ability for the original goals of the project to be as successful as they could be.

[+] andrewmcwatters|1 year ago|reply

Yeah, it's the primary reason we used it. If OpenTelemetry's raison d'être was simply to give Datadog a reason to not bullshit their customers on pricing, it would fulfill a major need in platform services.

[+] doctorpangloss|1 year ago|reply

I don’t know what the Sentry guy is really saying - I mean you can write whatever code you want, go for it man.

But I do have to “pip uninstall sentry-sdk” in my Dockerfile because it clashes with something I didn’t author. And anyway, because it is completely open source, the flaws in OpenTelemetry for my particular use case took an hour to surmount, and vitally, I didn’t have to pay the brain damage cost most developers hate: relationships with yet another vendor.

That said I appreciate all the innovation in this space, from both Sentry and OpenTelemetry. The metrics will become the standard, and that’s great.

The problem with Not OpenTelemetry: eventually everyone is going to learn how to use Kubernetes, and the USP of many startup offerings will vanish. OpenTelemetry and its feature scope creep make perfect sense for people who know Kubernetes. Then it makes sense why you have a wire protocol, why abstraction for vendors is redundant or meaningless toil, and why PostHog and others stop supporting Kubernetes: it competes with their paid offering.

[+] marcosdumay|1 year ago|reply

> eventually everyone is going to learn how to use Kubernetes

That seems obviously true... yet, there are so many people out there that seem unable to learn it that I don't think it's a reliable prediction.

[+] zitterbewegung|1 year ago|reply

Why have I heard only bad things on k8s? To the point where it’s a meme to understand k8s…

[+] MapleWalnut|1 year ago|reply

The Sentry SDK is open source and easy to contribute to in my experience.

[+] ankitnayan|1 year ago|reply

I think all of us agree that OpenTelemetry's end-goal of making Observability vendor neutral is futuristic and inevitable. We can complain about it being hard to get started, bloated, etc but the value it provides is clear, esp, when you are paying $$$ to a vendor and stuck with it.

OpenStandards also open up a lot of usecases and startups too. SigNoz, TraceTest, TraceLoop, Signadot, all are very interesting projects which OpenTelemetry enabled.

The majority of the problem seems like sentry is not able to provide it's sentry like features by adopting otel. Getting involved at the design phase could have helped shaped the project that could have considered your usecases. The maintainers have never been opposed to such contributions AFAIK.

Regarding, limiting otel just to tracing would not be sufficient today as the teams want a single platform for all observability rather than different tools for different signals.

I have seen hundreds of companies switch to opentelemetry and save costs by being able to choose the best vendor supporting their usecases.

lack of docs, learning curve, etc are just temporary things that can happen with any big project and should be fixed. Also, otel maintainers and teams have always been seeking help in improving docs, showcasing usecases, etc. If everyone cares enough for the bigger picture, the community and existing vendors should get more involved in improving things rather than just complaining.

[+] no_circuit|1 year ago|reply

IMO this boils down how one gets paid to understand or misunderstand something. A telemetry provider/founder is being commoditized by an open specification in which they do not participate in its development -- implied by the post saying the author doesn't know anyone on the spec committee(s). No surprise here.

Of course implementing a spec from the provider point of view can be difficult. And also take a look at all the names of the OTEL community and notice that Sentry is not there: https://github.com/open-telemetry/community/blob/86941073816.... This really isn't news. I'd guess that a Sentry customer should just be able to use the OTEL API and could just configure a proprietary Sentry exporter, for all their compute nodes, if Sentry has some superior way of collecting and managing telemetry.

IMO most library authors do not have to worry about annotation naming or anything like that mentioned in the post. Just use the OTEL API for logs, or use a logging API where there is an OTEL exporter, and whomever is integrating your code will take care of annotating spans. Propagating span IDs is the job of "RPC" libraries, not general code authors. Your URL fetch library should know how to propagate the Span ID provided that it also uses the OTEL API.

It is the same as using something like Docker containers on a serverless platform. You really don't need to know that your code is actually being deployed in Kubernetes. Use the common Docker interface is what matters.

[+] chipdart|1 year ago|reply

> IMO this boils down how one gets paid to understand or misunderstand something.

I completely agree. The most charitable interpretation of this blog post is that the blogger genuinely fails go understand the basics of the problem domain, or worst case scenario they are trying to shitpost away the need for features that are well supported by a community-driven standard like OpenTelemetry.

[+] zeeg|1 year ago|reply

Author here.

Y’all realize we’d just make more money if everyone has better instrumentation and we could spend less time on it, and more time on the product, right?

There is no conspiracy. It’s simple math and reasoning. We don’t compete with most otel consumers.

I don’t know how you could read what I posted and think sentry believes otel is a threat, let alone from the fact that we just migrated our JS SDK to run off it.

[+] serverlessmom|1 year ago|reply

I think that a number of Observability providers are looking at how they can add features and value to parts of monitoring that OTel effectively commoditizes. Thinking of the tail-based sampling implemented at Honeycomb for APM, or synthetic monitoring by my own team at Checkly.

"In 2015 Armin and I built a spec for Distributed Tracing. Its not a hard problem, it just requires an immense amount of coordination and effort." This to me feels like a nice glass of orange juice after brushing my teeth. The spec on DT is very easy, but the implementation is very very hard. The fact that OTel has nurtured a vast array of libraries to aid in context propagation is a huge acheivement, and saying 'This would all work fine if everyone everywhere adopted Sentry' is... laughable.

Totally outside the O11y space, OTel context propagation is an intensely useful feature because of how widespread it is. See Signadot implementing their smart test routing with OpenTelemetry: https://www.signadot.com/blog/scaling-environments-with-open...

[+] serverlessmom|1 year ago|reply

An argument that OpenTelemetry is somehow 'too big' is an example of motivated reasoning. I can understand that A Guy Who Makes Money If You Use Sentry dislikes that people are using OTel libraries to solve similar problems.

Context propagation and distributed tracing are cool OTel features! But they are not the only thing OTel should be doing. OpenTelemetry instrumentation libraries can do a lot on their own, a friend of mine made massive savings in compute efficiency with the NodeJS OTel library: https://www.checklyhq.com/blog/coralogix-and-opentelemetry-o...

[+] zeeg|1 year ago|reply

Author here.

OpenTelemetry is not competitive to us (it doesn’t do what we do in plurality), and we specifically want to see the open tracing goals succeed.

I was pretty clear about that in the post though.

[+] wdb|1 year ago|reply

Personally, I like OpenTelemetry, nice standardised approach. I just wished the vendors would have better support for the semantic conventions defined for a wide variety of traces.

I quite like the idea of only need to change one small piece of the code to switch otel exporters instead of swapping out a vendor trace sdk.

My main gripe with OpenTelemetry I don't fully understand what the exact difference is between (trace) events and log records.

[+] yunwal|1 year ago|reply

> My main gripe with OpenTelemetry I don't fully understand what the exact difference is between (trace) events and log records.

This is my main gripe too. I don't understand why {traces, logs, metrics} are not just different abstractions built on top of "events" (blobs of data your application ships off to some set of central locations). I don't understand why the opentelemetry collector forces me to re-implement the same settings for all of them and import separate libraries that all seem to do the same thing by default. Besides sdks and processors, I don't understand the need for these abstractions to persist throughout the pipeline. I'm running one collector, so why do I need to specify where my collector endpoint is 3 different times? Why do I need to specify that I want my blobs batched 3 different times? What's the point of having opentelemetry be one project at all?

My guess is this is just because opentelemetry started as a tracing project, and then became a logs and metrics project later. If it had started as a logging project, things would probably make more sense.

[+] mikeshi42|1 year ago|reply

It's a bit confusing but here's my best attempt to explain it:

- Trace events (span events) are intended to be structured events and possibly can have semantic attributes behind them - similar to how spans have semantic attributes. They're great if your team is all bought in on tracing as an organization. They will colocate your span events with your parent span. In practice they have poor searchability/indexing in many tools, so they should only be used if you only intend to use them when you will discover the span first. (Ex. debug info that is only useful to figure out why a span was very slow and you're okay not being easily searchable)

- Log records are plain old logs, they should be structured, but don't have to be, and there isn't a high expectation of structured data, much less semantic attributes. Logs can be easily adopted without buying into tracing.

- Events API, this is an experimental part of Otel, but is intended to be an API that emits logs with the expectation of semantic conventions (and therefore is also structured). Afaik end users are not the intended audience of this API.

Many teams fall along the spectrum of logs vs tracing which is why there's options to do things multiple ways. My personal take is that log records are going to continue to be more flexible than span events as an end-user given the state of current tools.

Disclaimer: I help build hyperdx, we're oss, otel-based observability and we've made product decisions based on the above opinions.

[+] tnolet|1 year ago|reply

Can you give an example of the missing semantic conventions?

[+] AndreasBackx|1 year ago|reply

I have been trying to find an equivalent for `tracing` first in Python and this week in TypeScript/JavaScript. At my work I created an internal post called "Better Python Logging? Tracing for Python?" that basically asks this question. OpenTelemetry was also what I looked at and since I have looked at other tooling.

It is hard to explain how convenient `tracing` is in Rust and why I sorely miss it elsewhere. The simple part of adding context to logs can be solved in a myriad of ways, yet all boil down to a similar "span-like" approach. I'm very interested in helping bring what `tracing` offers to other programming communities.

It very likely is worth having some people from the space involved, possibly from the tracing crate itself.

[+] zeeg|1 year ago|reply

We’ll fund solving this as long as the committees agree with the goal. We just want standard tracing implementations.

(Speaking on behalf of Sentry)

[+] wvh|1 year ago|reply

I have surveyed this landscape for a number of years, though I'm not involved enough to have strong opinions. We're running a lot of Prometheus ecosystem and even some OpenTelemetry stacks across customers. OpenTelemetry does seem like one of these projects with an ever expanding scope. It makes it hard to integrate parts you like and keep things both computing-wise and mentally lightweight without having to go all-in.

It's not anymore about hey, we'll include this little library or protocol instead of rolling our own, so we can hope to be compatible with a bunch of other industry-standard software. It's a large stack with an ever evolving spec. You have to develop your applications and infrastructure around it. It's very seductive to roll your own simpler solution.

I appreciate it's not easy to build industry-wide consensus across vendors, platforms and programming languages. But be careful with projects that fail to capture developer mindshare.

[+] pdimitar|1 year ago|reply

Could you clarify further on your reservations, please? As a programmer I appreciate only including a library in my project, give it a set OTLP settings (host, port, URI) and move on.

What difficulties did opting into OTel give you?

[+] fractalwrench|1 year ago|reply

The main interest I've seen in OTel from Android engineers has been driven by concerns around vendor lock-in. Backend/devops in their organisations are typically using OTel tooling already & want to see all telemetry in one place.

From this perspective it doesn't matter if the OTel SDK comes bundled with a bunch of unnecessary code or version conflicts as is suggested in the article. The whole point is to regain control over telemetry & avoid paying $$$ to an ambivalent vendor.

FWIW, I don't think the OTel implementation for mobile is perfect - a lot of the code was originally written with backend JVM apps in mind & that can cause friction. However, I'm fairly optimistic those pain points will get fixed as more folks converge on this standard.

Disclaimer: I work at a Sentry competitor

[+] markl42|1 year ago|reply

At the risk of hijacking the comments, I've been trying to use OTel recently to debug performance of a complex webpage with lots of async sibling spans, and finding it very very difficult to identify the critical path / bottlenecks.

There's no causal relationships between sibling spans. I think in theory "span links" solves this, but afaict this is not a widely used feature in SDKs are UI viewers.

(I wrote about this here https://github.com/open-telemetry/opentelemetry-specificatio...)

[+] diurnalist|1 year ago|reply

I don't believe this is a solved problem, and it's been around since OpenTracing days[0]. I do not think that the Span links, as they are currently defined, would be the best place to do this, but maybe Span links are extended to support this in the future. Right now Span links are mostly used to correlate spans causally _across different traces_ whereas as you point out there are cases where you want correlation _within a trace_.

[0]: https://github.com/opentracing/specification/issues/142

[+] hinkley|1 year ago|reply

I was underwhelmed by the max size for spans before they get rejected. Our app was about an order of magnitude too complex for OTEL to handle.

Reworking our code to support spans made our stack traces harder to read and in the end we turned the whole thing off anyway. Worse than doing nothing.

[+] tnolet|1 year ago|reply

A recent example of OTel confusion.

I could for the life of me not get the Python integration send traces to a collector. Same URL, same setup same API key as for Nodejs and Go.

Turns out the Python SDK expect a URL encoded header, e.g. “Bearer%20somekey” whereas all other SDKs just accept a string with a whitespace.

The whole split between HTTP, protobuf over HTTP and GRPC is also massively confusing.

[+] hinkley|1 year ago|reply

The silent failure policy of OTEL makes flames shoot out of the top of my head.

We had to use wireshark to identify a super nasty bug in the “JavaScript” (but actually typescript despite being called opentelemetryjs) implementation.

And OTEL is largely unsuitable for short lived processes like CLIs, CI/CD. And I would wager the same holds for FaaS (Lambda).

In the end I prefer the network topology of StatsD, which is what we were migrating from. Let the collector do ALL of the bookkeeping instead of faffing about. OTEL is actively hostile to process-per-thread programming languages. If I had it to do over again I’d look at the StatsD->Prometheus integrations, and the StatsD extensions that support tagging.

[+] hahn-kev|1 year ago|reply

Sounds like a problem with the Python sdk

[+] NeutralForest|1 year ago|reply

It resonates. As an intern I had to add OTEL to a Python project and I had to spend a lot of time in the docs to understand the concepts and implementation. Also, the Python impl has a lot of global state that makes it hard to use properly imo.

[+] chipdart|1 year ago|reply

> As an intern I had to ${DO_SOME_PROJECT} and I had to spend a lot of time in the docs to understand the concepts and implementation

That sounds like every single run-of-the-mill internship.

[+] zaphar|1 year ago|reply

Tracing requires keeping mappings for tracing identifiers per request. I don't know you do that without global state unless you want the tracing identifiers to pollute your own internal apis everywhere.

[+] BiteCode_dev|1 year ago|reply

100% agree.

Every time I tried to use OT I was reading the doc and whispering "but, why? I only need...".

[+] Karrot_Kream|1 year ago|reply

Yeah I was going down this path for a side project I was getting going and spent a couple days of after-work time exploring how to get just some basic traces in OT and realized it was much more than I needed or cared about.

[+] spullara|1 year ago|reply

There is a huge whole in using spans as they are specified. Without separating the start of a span from the end of a span you can never see things that never complete, fail hard enough to not close the span, or travel through queues. This is a compromise they made because typical storage systems for tracing aren't really good enough to stitch them all back together quickly. Everyone should be sending events and stitching it all together to create the view. But instead we get a least common denominator solution.

[+] drewbug01|1 year ago|reply

As a contributor to (and consumer of) OpenTelemetry, I think critique and feedback is most welcome - and sorely needed.

But this ain’t it. In the opening paragraphs the author dismisses the hardest parts of the problem (presumably because they are human problems, which engineers tend to ignore), and betrays a complete lack of interest in understanding why things ended up this way. It also seems they’ve completely misunderstood the API/SDK split in its entirety - because they argue for having such a split. It’s there - that’s exactly what exists!

And it goes on and on. I think it’s fair to critique OpenTelemetry; it can be really confusing. The blog post is evidence of that, certainly. But really it just reads like someone who got frustrated that they didn’t understand how something worked - and so instead of figuring it out, they’ve decided that it’s just hot garbage. I wish I could say this was unusual amongst engineers, but it isn’t.

[+] zeeg|1 year ago|reply

Author here.

That’s kind of making my point for me fwiw. It’s too complicated. I consider myself a product person so this is my version of that lens on the problem.

I’m not dismissing the people problem at all - I actually am trying to suggest the technology problem is the easier part (eg a basic spec). Getting it implemented, making it easy to understand, etc is where I see it struggling right now.

Aside this is not just my feedback, it’s a synthesis of what I’m hearing (but also what I believe).

[+] klabb3|1 year ago|reply

No dog in the fight here, but… you're saying that one of the top guys at a major observability shop didn’t understand Open Telemetry, then that’s saying much more about OT than it does about his skills or efforts to understand. After all, his main point is that it’s complex and overengineered, which is the key takeaway for curious bystanders like me, whether every detail is technically correct or not.

> it just reads like someone who […] didn’t understand how something worked - and so instead of figuring it out, they’ve decided that it’s just hot garbage.

And what about average developers asked to “add telemetry” to their apps and libraries? Their patience will be much lower than that.

Not necessarily defending the content (frankly it should have had more examples), but I relate to the sentiment. As a developer, I need framework providers to make sane design decisions with minimal api surface, otherwise I’d rather build something bespoke or just not care.

[+] arccy|1 year ago|reply

indeed, it just sounds like they're complaining they don't have a seat at the table...

[+] shaqbert|1 year ago|reply

Otel is indeed quite complex. And the docs are not meant for quick wins...

Otelbin [0] has helped me quite a bit in configuring and making sense of it, and getting stuff done.

[0]: https://www.otelbin.io/

[+] wdb|1 year ago|reply

That looks pretty cool! OpenTelemetry Collector configuration files are pretty confusing. Do like the collector, though. Makes it easy to sent a subset of your telemetry to trusted partners.

[+] epgui|1 year ago|reply

Anyone else finding this very difficult to read? I’d really recommend feeding this through a grammar checker, because poor grammar betrays unclear thinking.

[+] zeeg|1 year ago|reply

So you’re saying it makes my thinking more clear? :)

This is what happens when you use a tool designed for authoring code to also author content.

[+] grenbys|1 year ago|reply

I think there are two separate perspectives. For developers Open Telemetry is a clear win - high-quality vendor agnostic instrumentation backed by a reputable orgs. I instrumented with traces many business critical repos at my company (major customer support SaaS) with OTEL in Ruby, Python, JS. Not once was I confused/blocked/distracted by the presence of logs/metrics in the spec. However, can’t say much from the observability vendor perspective trying to be fully compatible with OTEL spec including metrics/logs. Article mentions customers having issues with using tracing instrumentation - it would’ve been great to back this up with corresponding github issues explaining the problems. Based on the presented JS snippet (just my guess) maybe the issue is with async code where the “span.operation” span gets immediately closed w/o waiting for the doTheThing()? Yeah - that’s tricky in JS given its async primitives. We ended up just maintaining a global reference to the currently active span and patching some OTEL packages to respect that. FWIW Sentry JS instrumentation IS really good and practical. Would have been great if Sentry could donate/contribute/influence to OTEL JS SIG with specific improvements - would be a win-win. As much as I hate DataCanine pricing they did effectively donated their Ruby tracing instrumentation to OTEL which I think is one of the best ones out there.

[+] hobofan|1 year ago|reply

This seems to be more of a branding problem than anything.

OP (rightfully) complains that there is a mismatch between what they (can) advertise ("We support OTEL") and what they are actually providing to the user. I have the same pain point from the consumer side, where I have to trial multiple tools and service to figure out which of them actually supports the OTEL feature set I care about.

I feel like this could be solved by introducing better branding that has a clearly defined scope of features inside the project (like e.g. "OTEL Tracing") which can serve as a direct signifier to customers about what feature set can be expected.

[+] zeeg|1 year ago|reply

Yes! Its a bit deeper than that but its fundamentally a packaging issue.

[+] antonyt|1 year ago|reply

OTel is flawed for sure, but I don't understand the stance against metrics and logs. Traces are inherently sampled unless you're lighting all your money on fire, or operating at so small a scale that these decisions have no real impact. There are kinds of metrics and logs which you always want to emit because they're mission-critical in some way. Is this a Sentry-specific thing? Does it just collapse these three kinds of information into a single thing called a "trace"?

[+] Dextro|1 year ago|reply

I mean, when you're the one selling the gas to light that money on fire you have a vested interest in keeping it that way right?

I do agree that logging and spans are very similar, but I disagree that logs are just spans because they aren't exactly the same.

I also agree that you can collect all metrics from spans and, in fact, it might be a better way to tackle it. But it's just not feasible to do so monetarily so you do need to have some sort of collection step closer to the metric producers.

What I do agree with is that the terminology and the implementation of OTEL's SDK is incredibly confusing and hard to implement/keep up to date. I spent way too many hours of my career struggling with conflicting versions of OTEL so I know the pain and I desperately wish they would at least take to heart the idea of separating implementation from API.

[+] the_mitsuhiko|1 year ago|reply

> Traces are inherently sampled unless you're lighting all your money on fire

You can burn a lot of money with logs and metrics too. The question is how much value you get for the money you throw on the burning pile of monitoring. My personal belief is that well instrumented distributed tracing is more actionable than logs and metrics. Even if sampled.

(Disclaimer: I work at sentry)

[+] aleph_minus_one|1 year ago|reply

> OTel is flawed for sure, but I don't understand the stance against metrics and logs.

Even if you don't want to consider the privacy concerns: telemetry wastes quite some data of your internet connection.

174 comments