The whole time I was learning/porting to Otel I felt like I was back in the Java world again. Every time I stepped through the code it felt like EnterpriseFizzBuzz. No discoverability. At all. And their own jargon that looks like it was made by people high on something.
And in NodeJS, about four times the CPU usage of StatsD. We ended up doing our own aggregation to tamp this down and to reduce tag proliferation (StatsD is fine having multiple processes reporting the same tags, OTEL clobbers). At peak load we had 1 CPU running at 60-80% utilization. Until something changes we couldn’t vertically scale. Other factors on that project mean that’s now unlikely to happen but it grates.
OTEL is actively hostile to any language that uses one process per core. What a joke.
Just go with Prometheus. It’s not like there are other contenders out there.
I'm fairly convinced that OTEL is in a form of 'vendor capture', i.e. because the only way to get a standard was to compromise with various bigcorps and sloppy startups to glue-gun it all together.
I tried doing a simple otel setup in .NET and after a few hours of trying to grok the documentation of the vendor my org has chosen, hopped into a discord run by a colleague that has part of their business model around 'pay for the good otel on the OSS product' and immediately stated that whatever it cost, it was worth the money.
I'd rather build another reliable event/pubsub library without prior experience than try to implement OTEL.
> It’s not like there are other contenders out there.
Apache Skywalking might be worth a look in some circumstances, doesn't eat too many resources, is fairly straightforwards to setup and run, admittedly somewhat jank (not the most polished UI or docs), but works okay: https://skywalking.apache.org/
This matches my conclusion as well. Just use Prometheus and whatever client library for your language of choice, it's 1000x simpler than the OTEL story.
Using otel from C++ side... To have cumulative metrics from multiple applications (e.g. not "statds/delta") I create a relatively low cardinality process.vpid integer (and somehow coordinate this number to be unique as long as the app emitting it is stil alive) - you can use some global object to coordinate it.
Then you can have something that sums, and removes the attribute.
With statsd/delta if you lose sending a signal - then all data gets skewed, with cumulation - you only use precision.
edit... forgot to say - my use case is "push based" metrics as these are coming from "batch" tools, not long running processes that can be scraped.
Otel seems complicated because different observability vendors make implementing observability super easy with their proprietary SDK’s, agents and API’s. This is what Otel wants to solve and I think the people behind it are doing a great job.
Also kudos to grafana for adopting OpenTelemetry as a first class citizen of their ecosystem.
I’ve been pushing the use of Datadog for years but their pricing is out of control for anyone between mid size company and large enterprises. So as years passed and OpenTelemetry API’s and SDK’s stabilized it became our standard for application observability.
To be honest the documentation could be better overall and the onboarding docs differ per programming language, which is not ideal.
My current team is on a NodeJS/Typescript stack and we’ve created a set of packages and an example Grafana stack to get started with OpenTelemetry real quick.
Maybe it’s useful to anyone here: https://github.com/zonneplan/open-telemetry-js
> Otel seems complicated because different observability vendors make implementing observability super easy with their proprietary SDK’s, agents and API’s. This is what Otel wants to solve and I think the people behind it are doing a great job.
Wait... so, the problem is that everyone makes it super easy, and so this product solves that by being complicated? ;P
> I’ve been pushing the use of Datadog for years but their pricing is out of control for anyone between mid size company and large enterprises
Not a fan of datadog vs just good metric collection. OTOH while I see the value of OTEL vs what I prefer to do... in theory.
My biggest problem with all of the APM vendors, once you have kernel hooks via your magical agent all sorts of fun things come up that developers can't explain.
My favorite example: At another shop we eventually adopted Dynatrace. Thankfully our app already had enough built-in metrics that a lead SRE considered it a 'model' for how to do instrumentation... I say that because, as soon as Dynatrace agents got installed on the app hosts, we started having various 'heisenbugs' requiring node restarts as well as a directly measured drop in performance. [0]
Ironically, the metrics saved us from grief, yet nobody had an idea how to fix it. ;_;
[0] - Curiously, the 'worst' one was MSSQL failovers on update somehow polluting our ADO.NET connection pools in a bad way...
It is as complicated as you want or need it to be. You can avoid any magic and stick to a subset that is easy to reason about and brings the most value in your context.
For our team, it is very simple:
* we use a library send traces and traces only[0]. They bring the most value for observing applications and can contain all the data the other types can contain. Basically hash-maps vs strings and floats.
* we use manual instrumentation as opposed to automatic - we are deliberate in what we observe and have great understand of what emits the spans. We have naming conventions that match our code organization.
* we use two different backends - an affordable 3rd party service and an all-on-one Jaeger install (just run 1 executable or docker container) that doesn't save the spans on disk for local development. The second is mostly for piece of mind of team members that they are not going to flood the third party service.
[0] We have a previous setup to monitor infrastructure and in our case we don't see a lot of value of ingesting all the infrastructure logs and metrics. I think it is early days for OTEL metrics and logs, but the vendors don't tell you this.
I did not find that manual instrumentation made things simpler. You’re trading a learning curve that now starts way before you can demonstrate results for a clearer understanding of the performance penalties of using this Rube Goldberg machine.
Otel may be okay for a green field project but turning this thing on in a production service that already had telemetry felt like replacing a tire on a moving vehicle.
... if (and only if) all the libraries you use also stick to that subset, yea. That is overwhelmingly not true in my experience. And the article shows a nice concrete example of why.
For green-field projects which use nothing but otel and no non-otel frameworks, yea. I can believe it's nice. But I definitely do not live in that world yet.
One of my biggest problems was the local development story. I wanted logs, traces and metrics support locally but didn’t want to spin up a multitude of Docker images just to get that to work. I wanted logs to be able to check what my metrics, traces, baggage and activity spans look like before I deploy.
Recently, the .NET team launched .NET Aspire and it’s awesome. Super easy to visualize everything in one place in my local development stack and it acts as an orchestrator as code.
Then when we deploy to k8s we just point the OTEL endpoint at the DataDog Agent and everything just works.
We just avoid the DataDog custom trace libraries and SDK and stick with OTEL.
Takes 5 minutes to set it up locally on your dev machine the first time, from then on you can just have a separate terminal tab where you simply run `/path/to/openobserve` and that's it. They also offer a Docker image for local and remote running as well, if you don't want to have the grand complexity of a single statically-linked binary. :P
It's an all-in-one fully compliant OpenTelemetry backend with pretty graphs. I love it for my projects, hasn't failed me in any detectable way yet.
I'm not convinced by .NET Aspire. It solves a small problem (service discovery and orchestration for local development of multi service projects). But it solves this by making service discovery and orchestration an application level concern. With Aspire you needlessly add complexity at the app level and get locked into a narrow ecosystem. There are many proven alternatives like docker compose for local development. Aspire is not even that much if at all easier than using docker compose and env vars.
If you are doing otel with python, use Logfire's client... even if you don't use their offering.
It's foss, and ypu can point it to any otel compat enpoint. Plus the client that the pydantic team made is 10 times better and simpler than the official otel lib.
Definitely can relate, this is why
I started an open-source project that focus on making OpenTelemetry adoption as easy as running a single command line:
https://github.com/odigos-io/odigos
A lot of web frameworks etc do most of the instrumentation for you these days. For instance using opentelemetry-js and self hosting something like https://signoz.io should take less than an hour to get spun up and you get a ton of data without writing any custom code.
Context propagation isn't trivial on a multi-threaded async runtime. There are several ways to do it, but JVM agents that instrument bytecode are popular because they work transparently.
Same thing. OpenTelemetry grew up from Traces, but Metrics and Logs are much better left to specialized solutions.
Feels like a "leaky abstraction" (or "leaky framework") issue. If we wanted to put everything under one umbrella, then well, an SQL database can also do all these things at the same time! Doesn't mean it should.
If you get to the end you find that the pain was all self-inflicted. I found it to be very easy in Python with standard stacks (mysql, flask, redis, requests, etc), because you literally just do a few imports at the top of your service and it automatically hooks itself up to track everything without any fuss.
> I found it to be very easy in Python with standard stacks (mysql, flask, redis, requests, etc), because you literally just do a few imports at the top of your service and it automatically hooks itself up to track everything without any fuss.
Yes, but only if everything in your stack is supported by their auto instrumentation. Take `aiohttp` for example. The latest version is 3.11.X and ... their auto instrumentation claims to support `3.X` [0] but results vary depending on how new your `aiohttp` is versus the auto instrumentation.
It's _magical_ when it all just works, but that ends up being a pretty narrow needle to thread!
So recently I needed to this up for a very simple flask app. We're running otel-collector-contrib, jaeger-all-in-one, and prometheus on a single server with docker compose (has to be all within the corpo intranet for reasons..)
Traces work, and I have the spanmetrics exporter set up, and I can actually see the spanmetrics in prometheus if I query directly, but they won't show up in the jaeger "monitor" tab, no matter what I do.
I spent 3 days on this before my boss is like "why don't we just manually instrument and send everything to the SQL server and create a grafana dashboard from that" and agh I don't want to do that either.
Any advice? It's literally the simplest usecase but I can't get it to work. Should I just add grafana to the pile?
Not sure about this, I think the vendors were happy with their own proprietary code, agents and backends because the lock-in ensures that switching costs (in terms of writing all new code) are very high.
this is going to come off as being fussy, but 'implement' use to refer to the former activity, not the latter. which is fine, meanings change, its just amusing that we no longer have a word we can use for 'sitting down and writing software to match a specification' and only 'taking existing software and deploying it on servers'
Author is trying to do something difficult with a non-batteries-included open source (free to them) product. Seems quite uncomplicated given the circumstances.
The whole point of OTel is to not get bent over backwards by one of the SaaS "logging/tracing/telemetry" companies, and as such it's going to incur some cost/pain of its own, but typically the bargain is worth taking.
I agree. I tried to get it to work recently with datadog, but there was so many hiccups. I ended up having to use datadogs solution mostly. The documentation across everything is also kind of confusing
part of the reason for that experience is also because DataDog is not open telemetry native and all their docs and instructions encourage use of their own agents. Using DataDog with Otel is like trying to hold your nose round over your head
You should try Otel native observability platforms like SigNoz, Honeycomb, etc. your life will be much simpler
Disclaimer : i am one of the maintainers at SigNoz
The biggest barrier to setting up oTel for me is the development experience. Having a single open specification is fantastic, especially for portability, but the SDKs are almost overwhelmingly abstract and therefore difficult to intuit.
I used to really like Datadog for being a one-stop observability shop and even though the experience of integrating with it is still quite simple, I think product and pricing wise they've jumped the shark.
I'm much happier these days using a collection of small time services and self-hosting other things, and the only part of that which isn't joyful is the boilerplate and not really understanding when and why you should, say, use gRPC over HTTP, and stuff like that.
I still don’t understand what OTEL is. What problem is it solving? If it’s a standard what is the change for the end user? Is it not just a matter of continuing to use whatever (Prometheus, Grafana, etc) with the option to swap components out?
For the tracing part of Otel, neither Prometheus nor Grafana are capable of doing that. Tracing is the most mature part of Otel and the most compelling use case for it. For metrics, we've stayed with Prometheus and AWS Cloudwatch Metrics. The metrics part feels very under developed at the moment.
For example the author of the software instruments it with OTel -- either language interface or wire protocol -- and the operator of the software uses the backend of choice.
Otherwise, you have a combinatorial matrix of supported options.
(Naturally, this problem is moot if the author and operator are the same.)
i can report the same traces to jager if i want open source or i switch out the provider and it can go to aws x-ray (paid). without any code or config changes. pretty useful. yes, a tad clumsy to set up the first time.
Adopting OpenTelemetry does not have to be hard for common use-cases. On Kubernetes, the Dash0 operator (https://artifacthub.io/packages/search?repo=dash0-operator) automatically instruments Node.js and Java workloads (and soon other runtimes) with just a custom resource created in a namespace. It works with all OpenTelemetry backends I know of.
Disclaimer: I am one of the authors of the Dash0 operator and work on Dash0 (https://www.dash0.com/), an OpenTelemetry-native observability platform.
I am certainly biased here because OpenTelemetry and Prometheus have been at the core of my professional life for the past half decade, but I think that the biggest challenge, is that there are many different ways to get you to a good setup, and people get lost in the discovery of the available options.
Creating an HTTP endpoint that publishes metrics in a Prometheus-scrape-able format? Easy! Some boolean/float key-value-pairs with appropriate annotations (basically: is this a counter or a gauge?), and done! And that lead (and leads!) to some very usable Grafana dashboards-created-by-actual-users and therefore much joy.
Then, I read up on how to do things The Proper Way, and was initially very much discouraged, but decided to ignore All that Noise due to the existing solutions working so well. No complaints so far!
Glad I'm not the only one that feels this way. For a small application when you just want some metrics and observability, it's a big burden to get it all working.
On my own projects, I send the metrics I care about out through the logs and have another project I run collect and aggregate them from the logs. Probably “wrong” but it works and it's easy to set up.
If you already have reloadable configuration infrastructure, or plan to add it in the future, this is just spreading out your configuration capture. No thank you (and by “no thank you” I mean fuck right off).
If you want to improve your bus number for production triage, you have to make it so anyone (senior) can first identify and then reproduce the configuration and dependencies of the production system locally without interrupting any of the point people to do so. If you cannot see you cannot help.
Just because you’re one of k people who usually discover the problem quickly doesn’t mean you’ll always do it quickly. You have bad days. You have PTO. People release things or flip feature toggles that escape your notice. If you stop to entertain other people’s queries or theories you are guaranteed to be in for a long triage window, and a potential SLA violation. But if you never accept other perspectives then your blind spots can also make for SLA violations.
Let people putter on their own and they can help with the Pareto distributions. Encourage them to do so and you can build your bus number.
I spent altogether too much time trying to get the Rust otel libs working in a useful and concise way. After a few hours I junked it and went back to a direct use of a jaeger client sending off to the otel collector.
there's some gold here, but most of it is over in the consultant/vendor space today, I fear.
I'm literally porting some code to Otel now and here is what I landed on, even before this article: It is confusing because it's a topic that uses vague terminology that means different things in different domains. For example, I'm looking at one OTel ui and "Traces" are the individual http requests to a service. In another UI, against the same data, "Traces" are the log messages from code in the service, and "Requests" are the individual http requests. To wire up in code, there's yet other terminology.
I haven't decided exactly what to blame for this. In some ways, it's necessary to have vague, inconsistent terminology to cover various use cases. And, to be fair some of the UIs predate OTel.
Interesting. We're trying to cut costs on APM so we've been moving toward opensource alternatives. Setting up OTEL is definitely tedious, especially for traces and DT wasn't making it easier. I've been checking out a few alts, Signoz, Odigos, Chronosphere... there a few others too but these guys stood out. As much as we want to build out OTEl ourselves, looking for a solution to make the transition easy seems like the way to go.
OTEL always seems way too complicated to use to me. Especially if you want to understand what it is doing. The code has a lot of abstractions and indirection (at least in Go).
And reading this it seems a lot of people agree. Hope that can be fixed at some point. Tracing should be simple.
Go with OTel is, unfortunately, known to be challenging ergonomics-wise. The OTel project doesn't really define an ergonomics standard, and leaves it up to the groups for each sub-project (e.g., each of the 11 language groups) to define how they package things up, what convenience wrapper APIs they offer, etc.
In Go, currently it is a deliberate choice to be both very granular and specific, so that end-users can ultimately depend on only the exact packages they need and have nothing standing between them and any customization of the SDK they need to do for their organizations.
The main thing people sometimes struggle with at this point is threading context through all their calls if they haven't done that yet. It's annoying, but unfortunately running into a limitation of the Go language here. Most of the other languages (Java, .NET, Ruby, etc.) keep context implicitly available for you at all times because the languages provide that affordance.
So much pain related to context tracking. I'm growing more and more convinced that solving that problem will be the next big thing in PLs, probably in the form of effect systems.
Yeah... this is about how well every OTel migration goes, from what I've seen.
Docs are an absolute monstrosity that rival Bazel's for utility, but are far less complete. Implementations are extremely widely varied in support for basics. Getting X to work with OTel often requires exactly what they did here: reverse-engineering X to figure out where it does something slightly abnormal... which is normal, almost every library does something similar, because it's so hard to push custom data through these systems in a type-safe way, and many decent systems want type safety and will spend a lot of effort to get it.
It feels kinda like OAuth 2 tbh. Lots of promise, obvious desirable goals, but completely failing at everything involving consistent and standardized implementation.
In addition to OTEL, there are many other products, including Odigos, Beyla, Kubeshark, Malcolm, Falco, DDosify, Deepflow, Tetragon, and Retina. Deepflow is a free and open source product.
I literally gave a lightning talk on this in Kubecon NA last year. Here's the youtube video, might help you get some perspective
tl;dr
while there are certainly many areas to improve for the project, some reasons why it could seem complicated
Extensibility by Design: Flexibility in defining meters and signals ensures diverse use cases are supported.
It's still a relatively new technology (~3 years old), growing pains are expected. OpenTelemetry is still the most advanced open standard handling all three signals together.
That section reads like trashing on OTel; basically an annoyed rant. Clear debunking of your text exists and is easy to find f.ex. both OpenObserve and SigNoz can be trivially self-hosted and you will not "be charged extraordinary amounts of money" for it. Both take no more than 5 minutes, just set a few env vars, run a single command -- and you're done.
I can see the value in smaller software -- I fought for it many times, in fact -- but you will have to do better when making a case for your program. Just giving one semi-informed dismissive take reads like a beer-infused dismissal.
I wish I could move off NewRelic. Every time I post about it (seriously, check my post history) over the years, HN commenters try to convince me that it does automated metrics almost as good, or just as good, or even better.
Once in awhile I try to spin up OTel like they say. Every single time it sucks. I'll keep trying, though. NewRelic's pricing is so brutal that I hold out hope. Unfortunately, NR's product really is that good...
Have you considered Kamon instead? From personal experience it's really the best tracing solution for Akka and other libraries using Scala Futures. I haven't tried it, but it does have built-in Spring support as well.
I have not downvoted you but you seem to recommend a very specific and tailor-made product for a specific tech stack.
OpenTelemetry is universal. As long as you can send the right network packages to one of a number of ingesting programs, you can have pretty dashboards and a lot of insights, regardless of the programming language of the program that originated the metric / trace / log.
hinkley|1 year ago
And in NodeJS, about four times the CPU usage of StatsD. We ended up doing our own aggregation to tamp this down and to reduce tag proliferation (StatsD is fine having multiple processes reporting the same tags, OTEL clobbers). At peak load we had 1 CPU running at 60-80% utilization. Until something changes we couldn’t vertically scale. Other factors on that project mean that’s now unlikely to happen but it grates.
OTEL is actively hostile to any language that uses one process per core. What a joke.
Just go with Prometheus. It’s not like there are other contenders out there.
to11mtm|1 year ago
I tried doing a simple otel setup in .NET and after a few hours of trying to grok the documentation of the vendor my org has chosen, hopped into a discord run by a colleague that has part of their business model around 'pay for the good otel on the OSS product' and immediately stated that whatever it cost, it was worth the money.
I'd rather build another reliable event/pubsub library without prior experience than try to implement OTEL.
KronisLV|1 year ago
Apache Skywalking might be worth a look in some circumstances, doesn't eat too many resources, is fairly straightforwards to setup and run, admittedly somewhat jank (not the most polished UI or docs), but works okay: https://skywalking.apache.org/
Also I quite liked that a minimal setup is indeed pretty minimal: a web UI, a server instance and a DB that you already know https://skywalking.apache.org/docs/main/latest/en/setup/back...
In some ways, it's a lot like Zabbix in the monitoring space - neither will necessarily impress anyone, but both have a nice amount of utility.
sethops1|1 year ago
malkia|1 year ago
Then you can have something that sums, and removes the attribute.
With statsd/delta if you lose sending a signal - then all data gets skewed, with cumulation - you only use precision.
edit... forgot to say - my use case is "push based" metrics as these are coming from "batch" tools, not long running processes that can be scraped.
mkeedlinger|1 year ago
Xeago|1 year ago
Also open-source & self-hostable.
silisili|1 year ago
Otel is a design by committee garbage pile of half baked ideas.
paulddraper|1 year ago
(And some Go tbf.)
rtuin|1 year ago
I’ve been pushing the use of Datadog for years but their pricing is out of control for anyone between mid size company and large enterprises. So as years passed and OpenTelemetry API’s and SDK’s stabilized it became our standard for application observability.
To be honest the documentation could be better overall and the onboarding docs differ per programming language, which is not ideal.
My current team is on a NodeJS/Typescript stack and we’ve created a set of packages and an example Grafana stack to get started with OpenTelemetry real quick. Maybe it’s useful to anyone here: https://github.com/zonneplan/open-telemetry-js
saurik|1 year ago
Wait... so, the problem is that everyone makes it super easy, and so this product solves that by being complicated? ;P
to11mtm|1 year ago
Not a fan of datadog vs just good metric collection. OTOH while I see the value of OTEL vs what I prefer to do... in theory.
My biggest problem with all of the APM vendors, once you have kernel hooks via your magical agent all sorts of fun things come up that developers can't explain.
My favorite example: At another shop we eventually adopted Dynatrace. Thankfully our app already had enough built-in metrics that a lead SRE considered it a 'model' for how to do instrumentation... I say that because, as soon as Dynatrace agents got installed on the app hosts, we started having various 'heisenbugs' requiring node restarts as well as a directly measured drop in performance. [0]
Ironically, the metrics saved us from grief, yet nobody had an idea how to fix it. ;_;
[0] - Curiously, the 'worst' one was MSSQL failovers on update somehow polluting our ADO.NET connection pools in a bad way...
EdwardDiego|1 year ago
dimitar|1 year ago
For our team, it is very simple:
* we use a library send traces and traces only[0]. They bring the most value for observing applications and can contain all the data the other types can contain. Basically hash-maps vs strings and floats.
* we use manual instrumentation as opposed to automatic - we are deliberate in what we observe and have great understand of what emits the spans. We have naming conventions that match our code organization.
* we use two different backends - an affordable 3rd party service and an all-on-one Jaeger install (just run 1 executable or docker container) that doesn't save the spans on disk for local development. The second is mostly for piece of mind of team members that they are not going to flood the third party service.
[0] We have a previous setup to monitor infrastructure and in our case we don't see a lot of value of ingesting all the infrastructure logs and metrics. I think it is early days for OTEL metrics and logs, but the vendors don't tell you this.
madeofpalk|1 year ago
I'm still looking for an endpoint just to send simple one-off metrics to from parts of infrastructure that's not scrapable.
hinkley|1 year ago
Otel may be okay for a green field project but turning this thing on in a production service that already had telemetry felt like replacing a tire on a moving vehicle.
buzzdenver|1 year ago
mikestorrent|1 year ago
Groxx|1 year ago
... if (and only if) all the libraries you use also stick to that subset, yea. That is overwhelmingly not true in my experience. And the article shows a nice concrete example of why.
For green-field projects which use nothing but otel and no non-otel frameworks, yea. I can believe it's nice. But I definitely do not live in that world yet.
junto|1 year ago
Recently, the .NET team launched .NET Aspire and it’s awesome. Super easy to visualize everything in one place in my local development stack and it acts as an orchestrator as code.
Then when we deploy to k8s we just point the OTEL endpoint at the DataDog Agent and everything just works.
We just avoid the DataDog custom trace libraries and SDK and stick with OTEL.
Now it’s a really nice development experience.
https://learn.microsoft.com/en-us/dotnet/aspire/fundamentals...
https://docs.datadoghq.com/opentelemetry/#overview
rochacon|1 year ago
This project is really nice for that https://github.com/grafana/docker-otel-lgtm
masterj|1 year ago
pdimitar|1 year ago
Takes 5 minutes to set it up locally on your dev machine the first time, from then on you can just have a separate terminal tab where you simply run `/path/to/openobserve` and that's it. They also offer a Docker image for local and remote running as well, if you don't want to have the grand complexity of a single statically-linked binary. :P
It's an all-in-one fully compliant OpenTelemetry backend with pretty graphs. I love it for my projects, hasn't failed me in any detectable way yet.
WuxiFingerHold|1 year ago
Thaxll|1 year ago
BiteCode_dev|1 year ago
It's foss, and ypu can point it to any otel compat enpoint. Plus the client that the pydantic team made is 10 times better and simpler than the official otel lib.
Samuel Colvin has a cool intervew where he explains how he got there: https://www.bitecode.dev/p/samuel-colvin-on-logfire-mixing-p...
edenfed|1 year ago
pat2man|1 year ago
pranay01|1 year ago
hocuspocus|1 year ago
deepsun|1 year ago
Feels like a "leaky abstraction" (or "leaky framework") issue. If we wanted to put everything under one umbrella, then well, an SQL database can also do all these things at the same time! Doesn't mean it should.
PeterCorless|1 year ago
https://cra.mr/the-problem-with-otel/
incangold|1 year ago
But I still dislike OTel every time I have to deal with it.
BugsJustFindMe|1 year ago
baby_souffle|1 year ago
Yes, but only if everything in your stack is supported by their auto instrumentation. Take `aiohttp` for example. The latest version is 3.11.X and ... their auto instrumentation claims to support `3.X` [0] but results vary depending on how new your `aiohttp` is versus the auto instrumentation.
It's _magical_ when it all just works, but that ends up being a pretty narrow needle to thread!
[0]: https://github.com/open-telemetry/opentelemetry-python-contr...
verall|1 year ago
Traces work, and I have the spanmetrics exporter set up, and I can actually see the spanmetrics in prometheus if I query directly, but they won't show up in the jaeger "monitor" tab, no matter what I do.
I spent 3 days on this before my boss is like "why don't we just manually instrument and send everything to the SQL server and create a grafana dashboard from that" and agh I don't want to do that either.
Any advice? It's literally the simplest usecase but I can't get it to work. Should I just add grafana to the pile?
etimberg|1 year ago
nimish|1 year ago
andy800|1 year ago
paulddraper|1 year ago
If anything I think the backends were kinda slow to adopt.
convolvatron|1 year ago
dboreham|1 year ago
6r17|1 year ago
PeterZaitsev|1 year ago
jensensbutton|1 year ago
cglan|1 year ago
SomaticPirate|1 year ago
OTel is a bear though. I think the biggest advantage it gives you is the ability to move across tracing providers
pranay01|1 year ago
You should try Otel native observability platforms like SigNoz, Honeycomb, etc. your life will be much simpler
Disclaimer : i am one of the maintainers at SigNoz
ljm|1 year ago
I used to really like Datadog for being a one-stop observability shop and even though the experience of integrating with it is still quite simple, I think product and pricing wise they've jumped the shark.
I'm much happier these days using a collection of small time services and self-hosting other things, and the only part of that which isn't joyful is the boilerplate and not really understanding when and why you should, say, use gRPC over HTTP, and stuff like that.
cedws|1 year ago
hangonhn|1 year ago
paulddraper|1 year ago
For example the author of the software instruments it with OTel -- either language interface or wire protocol -- and the operator of the software uses the backend of choice.
Otherwise, you have a combinatorial matrix of supported options.
(Naturally, this problem is moot if the author and operator are the same.)
dionian|1 year ago
mmanciop|1 year ago
Disclaimer: I am one of the authors of the Dash0 operator and work on Dash0 (https://www.dash0.com/), an OpenTelemetry-native observability platform.
Automatic instrumentation on Kubernetes is also provided by the community OpenTelemetry (https://github.com/open-telemetry/opentelemetry-operator).
I am certainly biased here because OpenTelemetry and Prometheus have been at the core of my professional life for the past half decade, but I think that the biggest challenge, is that there are many different ways to get you to a good setup, and people get lost in the discovery of the available options.
antithesis-nl|1 year ago
Creating an HTTP endpoint that publishes metrics in a Prometheus-scrape-able format? Easy! Some boolean/float key-value-pairs with appropriate annotations (basically: is this a counter or a gauge?), and done! And that lead (and leads!) to some very usable Grafana dashboards-created-by-actual-users and therefore much joy.
Then, I read up on how to do things The Proper Way, and was initially very much discouraged, but decided to ignore All that Noise due to the existing solutions working so well. No complaints so far!
ejs|1 year ago
On my own projects, I send the metrics I care about out through the logs and have another project I run collect and aggregate them from the logs. Probably “wrong” but it works and it's easy to set up.
lexh|1 year ago
Turtles all the way down.
hinkley|1 year ago
If you already have reloadable configuration infrastructure, or plan to add it in the future, this is just spreading out your configuration capture. No thank you (and by “no thank you” I mean fuck right off).
If you want to improve your bus number for production triage, you have to make it so anyone (senior) can first identify and then reproduce the configuration and dependencies of the production system locally without interrupting any of the point people to do so. If you cannot see you cannot help.
Just because you’re one of k people who usually discover the problem quickly doesn’t mean you’ll always do it quickly. You have bad days. You have PTO. People release things or flip feature toggles that escape your notice. If you stop to entertain other people’s queries or theories you are guaranteed to be in for a long triage window, and a potential SLA violation. But if you never accept other perspectives then your blind spots can also make for SLA violations.
Let people putter on their own and they can help with the Pareto distributions. Encourage them to do so and you can build your bus number.
pnathan|1 year ago
there's some gold here, but most of it is over in the consultant/vendor space today, I fear.
shireboy|1 year ago
I haven't decided exactly what to blame for this. In some ways, it's necessary to have vague, inconsistent terminology to cover various use cases. And, to be fair some of the UIs predate OTel.
Alex-Programs|1 year ago
Loki works great.
vzbl9293|1 year ago
pranay01|1 year ago
For others checking out this thread, here's our github repo - https://github.com/signoz/signoz
PS: I am one of the maintainers at SigNoz
Cwizard|1 year ago
And reading this it seems a lot of people agree. Hope that can be fixed at some point. Tracing should be simple.
See for example this project: https://github.com/jmorrell/minimal-nodejs-otel-tracer
I think it is more a POC but it shows that all this complexity is not needed IMO.
phillipcarter|1 year ago
In Go, currently it is a deliberate choice to be both very granular and specific, so that end-users can ultimately depend on only the exact packages they need and have nothing standing between them and any customization of the SDK they need to do for their organizations.
There's some ways to isolate this kind of setup, which we document like so: https://opentelemetry.io/docs/languages/go/getting-started/#...
Stuff that into an otel.go file and then the rest of your code is usually pretty okay. From there your application code usually looks like this:
https://gist.github.com/cartermp/f37b6702109bbd7401be8a1cab8...
The main thing people sometimes struggle with at this point is threading context through all their calls if they haven't done that yet. It's annoying, but unfortunately running into a limitation of the Go language here. Most of the other languages (Java, .NET, Ruby, etc.) keep context implicitly available for you at all times because the languages provide that affordance.
andrewflnr|1 year ago
etimberg|1 year ago
Groxx|1 year ago
Docs are an absolute monstrosity that rival Bazel's for utility, but are far less complete. Implementations are extremely widely varied in support for basics. Getting X to work with OTel often requires exactly what they did here: reverse-engineering X to figure out where it does something slightly abnormal... which is normal, almost every library does something similar, because it's so hard to push custom data through these systems in a type-safe way, and many decent systems want type safety and will spend a lot of effort to get it.
It feels kinda like OAuth 2 tbh. Lots of promise, obvious desirable goals, but completely failing at everything involving consistent and standardized implementation.
gpi|1 year ago
vednig|1 year ago
almaight|1 year ago
pranay01|1 year ago
tl;dr
while there are certainly many areas to improve for the project, some reasons why it could seem complicated
Extensibility by Design: Flexibility in defining meters and signals ensures diverse use cases are supported.
It's still a relatively new technology (~3 years old), growing pains are expected. OpenTelemetry is still the most advanced open standard handling all three signals together.
[1]https://www.youtube.com/watch?v=xEu8_Aeo_-o
jensensbutton|1 year ago
anacrolix|1 year ago
pdimitar|1 year ago
I can see the value in smaller software -- I fought for it many times, in fact -- but you will have to do better when making a case for your program. Just giving one semi-informed dismissive take reads like a beer-infused dismissal.
icelancer|1 year ago
Once in awhile I try to spin up OTel like they say. Every single time it sucks. I'll keep trying, though. NewRelic's pricing is so brutal that I hold out hope. Unfortunately, NR's product really is that good...
pdimitar|1 year ago
unknown|1 year ago
[deleted]
hocuspocus|1 year ago
https://kamon.io
Edit: I wonder why suggesting JVM instrumentation that is much more polished than the OTel and Lightbend agents gets me downvoted?
pdimitar|1 year ago
OpenTelemetry is universal. As long as you can send the right network packages to one of a number of ingesting programs, you can have pretty dashboards and a lot of insights, regardless of the programming language of the program that originated the metric / trace / log.
linkerdoo|1 year ago
[deleted]