Migrating to OpenTelemetry

CSMastermind|2 years ago

> The data collected from these streams is sent to several vendors including Datadog (for application logs and metrics), Honeycomb (for traces), and Google Cloud Logging (for infrastructure logs).

It sounds like they were in a place that a lot of companies are in where they don't have a single pane of glass for observability. One of if not the main benefit I've gotten out of Datadog is having everything in Datadog so that it's all connected and I can easily jump from a trace to logs for instance.

One of the terrible mistakes I see companies make with this tooling is fragmenting like this. Everyone has their own personal preference for tool and ultimately the collective experience is significantly worse than the sum of its parts.

badloginagain|2 years ago

I feel we hold up single-observability-solution as the Holy Grail, and I can see the argument for it- one place to understand the health of your services.

But I've also been in terrible vendor lock-in situations, being bent over the barrel because switching to a better solution is so damn expensive.

At least now with OTel you have an open standard that allows you to switch easier, but even then I'd rather have 2 solutions that meet my exact observability requirements than a single solution that does everything OKish.

dexterdog|2 years ago

Depending on your usage it can be prohibitively expensive to use datadog for everything like that. We have it for just our prod env because it's just not worth what it brings to the table to put all of our logs into it.

maccard|2 years ago

I've spent a small amount of time in datadog, lots in grafana, and somewhere in between in honeycomb. Out applications are designed to emit traces, and comparing honeycomb with tracing to a traditional app with metrics and logs, I would choose tracing every time.

It annoys me that logs are overlooked in honeycomb, (and metrics are... fine). But, given the choice between a single pane of glass in grafana or having to do logs (and metrics sometimes) in cloudwatch but spending 95% of my time in honeycomb - I'd pick honeycomb every time

devin|2 years ago

Eh, personally I view honeycomb and datadog as different enough offerings that I can see why you'd choose to have both.

rewmie|2 years ago

> It sounds like they were in a place that a lot of companies are in where they don't have a single pane of glass for observability.

One of the biggest features of AWS which is very easy to take for granted and go unnoticed is Amazon CloudWatch. It supports metrics, logging, alarms, metrics from alarms, alarms from alarms, querying historical logs, trigger actions, etc etc etc. and it covers each and every single service provided by AWS including metaservices like AWS Config and Cloudtrail.

And you barely notice it. It's just there, and you can see everything.

> One of the terrible mistakes I see companies make with this tooling is fragmenting like this.

So much this. It's not fun at all to have to go through logs and metrics on any application,and much less so if for some reason their maintainers scattered their metrics emission to the four winds. However, with AWS all roads lead to Cloudwatch, and everything is so much better.

tapoxi|2 years ago

I made this switch very recently. For our Java apps it was as simple as loading the otel agent in place of the Datadog SDK, basically "-javaagent:/opt/otel/opentelemetry-javaagent.jar" in our args.

The collector (which processes and ships metrics) can be installed in K8S through Helm or an operator, and we just added a variable to our charts so the agent can be pointed at the collector. The collector speaks OTLP which is the fancy combined metrics/traces/logs protocol the OTEL SDKs/agents use, but it also speaks Prometheus, Zipkin, etc to give you an easy migration path. We currently ship to Datadog as well as an internal service, with the end goal being migrating off of Datadog gradually.

andrewstuart2|2 years ago

We tried this about a year and a half ago and ended up going somewhat backwards into DD entrenchment, because they've decided that anything not an official DD metric (that is, collected by their agent typically) is custom and then becomes substantially more expensive. We wanted a nice migration path from any vendor to any other vendor but they have a fairly effective strategy for making gradual migrations more expensive for heavy telemetry users. At least our instrumentation these days is otel, but it's the metrics we expected to just scrape from prometheus that we had to dial back and start using more official DD agent metrics and configs to get, lest our bill balloon by 10x. It's a frustrating place to be. Especially since it's still not remotely cheap, just that it could be way worse.

I know this isn't a DataDog post, and I'm a bit off topic, but I try to do my best to warn against DD these days.

MajimasEyepatch|2 years ago

It's interesting that you're using both Honeycomb and Datadog. With everything migrated to OTel, would there be advantages to consolidating on just Honeycomb (or Datadog)? Have you found they're useful for different things, or is there enough overlap that you could use just one or the other?

bhyolken|2 years ago

Author here, thanks for the question! The current split developed from the personal preferences of the engineers who initially set up our observability systems, based on what they had used (and liked) at previous jobs.

We're definitely open to doing more consolidation in the future, especially if we can save money by doing that, but from a usability standpoint we've been pretty happy with Honeycomb for traces and Datadog for everything else so far. And, that seems to be aligned with what each vendor is best at at the moment.

Jedd|2 years ago

The killer feature of OpenTelemetry for us is brokering (with ETL).

Partly this lets us easily re-route & duplicate telemetry, partly it means changes to backend products in the future won't be a big disruption.

For metrics we're a mostly telegraf->prometheus->grafana mimir shop - telegraf because its rock solid and feature-rich, prometheus because there's no real competition in that tier, and mimir because of scale & self-host options.

Our scale problem means most online pricing calculators generate overflow errors.

Our non-security log destination preference is Loki - for similar reasons to Mimir - though a SIEM it definitely is not.

Tracing to a vendor, but looking to bring that back to grafana Tempo. Product maturity is a long way off commercial APM offerings, but it feels like the feature-set is about 70% there and converging rapidly. Off-the-shelf tracing products have an appealingly low cost of entry, which only briefly defers lock-in & pricing shocks.

pranay01|2 years ago

Yeah, the ability to send to multiple sources is quite powerful and most of this comes from the configurability of Otel Collector [1].

If you are looking for a open source backend for OpenTelemetry, then you can explore SigNoz[2] (I am one of the founders) We have a quite a decent product for APM/tracing leveraging opentelemerty native data format and semantic convention.

[1]https://opentelemetry.io/docs/collector/ [2]https://github.com/SigNoz/signoz

hagen1778|2 years ago

> mimir because of scale & self-host options

Have you looked at VictoriaMetrics [0] before opting for Mimir?

[0] https://victoriametrics.com/blog/mimir-benchmark/

nevon|2 years ago

I would love to save a few hundred thousands a year by running Otel collector over Datadog agents, just on the cost-per-host alone. Unfortunately that would also mean giving up Datatog APM and NPM, as far as I can tell, which have been really valuable. Going back to just metrics and traces would feel like quite the step backwards and be a hard sell.

arccy|2 years ago

you can submit opentelemetry traces to datadog which should be the equivalent of apm/npm, though maybe with a less polished integration.

nullify88|2 years ago

One thing that's slightly off putting about OpenTelemetry is how resource attributes don't get included as prometheus labels for metrics, instead they are on an info metric which requires a join to enrich the metric you are interested in.

Luckily the prometheus exporters have a switch to enable this behaviour, but there's talk of removing this functionality because it breaks the spec. If you were to use the OpenTelemetry protocol in to something like Mimir, you don't have the option of enabling that behaviour unless you use prometheus remote write.

Our developers aren't a fan of that.

https://opentelemetry.io/docs/specs/otel/compatibility/prome...

valyala|2 years ago

FYI, VictoriaMetrics converts resource attributes to ordinary labels before storing metrics received via OoenTelemetry protocol - https://docs.victoriametrics.com/#sending-data-via-opentelem... . This simplifies filtering and grouping of such metrics during querying. For example, you need to write `my_metric{resource_name="foo"}` instead of `my_metric * on(resource_id) group_left() resource_info{resource_name="foo"}` when filtering by `resource_name`.

ronyaurora|2 years ago

If you are using the prometheus exporter, you can use the transform processor to get specific resource attributes into metric labels.

With the advantage that you get only the specific attributes you want, thus avoiding a cardinality explosion.

https://github.com/open-telemetry/opentelemetry-collector-co...

roskilli|2 years ago

> Moreover, we encountered some rough edges in the metrics-related functionality of the Go SDK referenced above. Ultimately, we had to write a conversion layer on top of the OTel metrics API that allowed for simple, Prometheus-like counters, gauges, and histograms.

Have encountered this a lot from teams attempting to use the metrics SDK.

Are you open to comment on specifics here and also what kind of shim you had to put in front of the SDK? It would be great to continue to retrieve feedback so that we can as a community have a good idea of what remains before it's possible to use the SDK for real world production use cases in anger. Just wiring up the setup in your app used to be fairly painful but that has gotten somewhat better over the last 12-24 months, I'd love to also hear what is currently causing compatibility issues w/ the metric types themselves using the SDK which requires a shim and what the shim is doing to achieve compatibility.

bhyolken|2 years ago

Sure, happy to provide more specifics!

Our main issue was the lack of a synchronous gauge. The officially supported asynchronous API of registering a callback function to report a gauge metric is very different from how we were doing things before, and would have required lots of refactoring of our code. Instead, we wrote a wrapper that exposes a synchronous-like API: https://gist.github.com/yolken-airplane/027867b753840f7d15d6....

It seems like this is a common feature request across many of the SDKs, and it's in the process of being fixed in some of them (https://github.com/open-telemetry/opentelemetry-specificatio...)? I'm not sure what the plans are for the golang SDK specifically.

Another, more minor issue, is the lack of support for "constant" attributes that are applied to all observations of a metric. We use these to identify the app, among other use cases, so we added wrappers around the various "Add", "Record", "Observe", etc. calls that automatically add these. (It's totally possible that this is supported and I missed it, in which case please let me know.)

Overall, the SDK was generally well-written and well-documented, we just needed some extra work to make the interfaces more similar to the ones we were using before.

caust1c|2 years ago

Curious about the code implemented for logs! Hopefully that's something that can be shared at some point. Also curious if it integrates with `log/slog` :-)

Congrats too! As I understand it from stories I've heard from others, migrating to OTel is no easy undertaking.

bhyolken|2 years ago

Thanks! For logs, we actually use github.com/segmentio/events and just implemented a handler for that library that batches logs and periodically flushes them out to our collector using the underlying protocol buffer interface. We plan on migrating to log/slog soon, and once we do that we'll adapt our handler and can share the code.

throwaway084t95|2 years ago

What is the "first principles" argument that observability decomposes into logs, metrics, and tracing? I see this dogma accepted everywhere, but I'm inquisitive about it

yannyu|2 years ago

First you had logs. Everyone uses logs because it's easy. Logs are great, but suddenly you're spending a crapton of time or money maintaining terabytes or petabytes of storage and ingest of logs. And even worse, in some cases for these logs, you don't actually care about 99% of the log line and simply want a single number, such as CPU utilization or the value of the shopping cart or latency.

So, someone says, "let's make something smaller and more portable than logs. We need to track numerical data over time more easily, so that we can see pretty charts of when these values are outside of where they should be." This ends up being metrics and a time-series database (TSDB), built to handle not arbitrary lines of text but instead meant to parse out metadata and append numerical data to existing time-series based on that metadata.

Between metrics and logs, you end up with a good idea of what's going on with your infrastructure, but logs are still too verbose to understand what's happening with your applications past a certain point. If you have an application crashing repeatedly, or if you've got applications running slowly, metrics and logs can't really help you there. So companies built out Application Performance Monitoring, meant to tap directly into the processes running on the box and spit out all sorts of interesting runtime metrics and events about not just the applications, but the specific methods and calls those applications are utilizing within their stack/code.

Initially, this works great if you're running these APM tools on a single box within monolithic stacks, but as the world moved toward Cloud Service Providers and containerized/ephemeral infrastructure, APM stopped being as effective. When a transaction starts to go through multiple machines and microservices, APM deployed on those boxes individually can't give you the context of how these disparate calls relate to a holistic transaction.

So someone says, "hey, what if we include transaction IDs in these service calls, so that we can post-hoc stitch together these individual transaction lines into a whole transaction, end-to-end?" Which is how you end up with the concept of spans and traces, taking what worked well with Application Performance Monitoring and generalizing that out into the modern microservices architectures that are more common today.

tsamba|2 years ago

Interesting read. What did you find easier about using GCP's log tooling for your internal system logs, rather than the OTel collector?

bhyolken|2 years ago

Author here. This decision was more about ease of implementation than anything else. Our internal application logs were already being scooped up by GCP because we run our services in GKE, and we already had a GCP->Datadog log syncer [1] for some other GCP infra logs, so re-using the GCP-based pipeline was the easiest way to handle our application logs once we removed the Datadog agent.

In the future, we'll probably switch these logs to also go through our collector, and it shouldn't be super hard (because we already implemented a golang OTel log handler for the external case), but we just haven't gotten around to it yet.

[1] https://docs.datadoghq.com/integrations/google_cloud_platfor...

clintonb|2 years ago

Their collector is used to send infrastructure logs to GCP (instead of Datadog).

My guess is this is to save on costs. GCP logging is probably cheaper than Datadog, and infrastructure logs may not be needed as frequently as application logs.

shoelessone|2 years ago

I really really want to use OTel for a small project but have always had a really tough time finding a path that is cheap or free for a personal project.

In theory you can send telemetry data with OTel to Cloud Watch, but I've struggle to connect the dots with the front end application (e.g. React/Next.js).

yourapostasy|2 years ago

Have you checked out Jaeger [1]? It is lightweight enough for a personal project, open source, and featureful enough to really help "turn on the lightbulb" with other engineers to show them the difference between logging/monitoring and tracing.

[1] https://www.jaegertracing.io/

arccy|2 years ago

grafana cloud, honeycomb, etc have free tiers, though you'll have to watch how much data you send them. or you can self host something like signoz or the elastic stack. frontend will typically go to an instance of opentelemetry collector to filter/convert to the protocol for the storage backend.

jon-wood|2 years ago

At the risk of being downvoted (probably justly) for having a moan, can we please have a moratorium on every blog post needing to have a generally irrelevant picture attached to it? On opening this page I can see 28 words that are actually relevant because almost the entire view is consumed by a huge picture of a graph and the padding around it.

This is endemic now. Doesn't matter what someone is writing about there'll be some pointless stock photo taking up half the page. There'll probably be some more throughout the page. Stop it please.

k__|2 years ago

I had the impression, logs and metrics are a pre-observability thing.

SteveNuts|2 years ago

I've never heard the term "pre-observability", what does that mean?

marcosdumay|2 years ago

Observability is about logs and metrics, and pre-observability (I guess you mean the high-level-only records simpler environments keep) is also about logs and metrics.

Anything you register to keep track of your environment has the form of either logs or metrics. The difference is about the contents of such logs and metrics.

75 comments