top | item 34442603

Launch HN: Odigos (YC W23) – Instant distributed tracing for Kubernetes clusters

162 points| edenfed | 3 years ago | reply

Hi HN! We’re Eden and Ari, co-founders of Odigos (https://github.com/keyval-dev/odigos). Odigos is an open-source project that lets you instantly generate distributed traces for your applications. It works alongside existing monitoring tools and does not require any code changes.

Our earlier experiences with monitoring tools were frustrating. Monitoring a distributed system with multiple microservices, we found ourselves spending way too much time trying to locate the specific microservice that was at the root of a problem. For example, we once spent hours debugging an application which we suspected was causing high latency, only to find out that the actual problem was rooted in a completely different application

Then we learned about distributed tracing, which solves exactly this problem. Unlike metrics or logs that capture a data point in time in a single application, a distributed trace follows a request as it propagates through a distributed environment by tagging it with a unique ID. This allows developers to understand the context of each request and how their distributed applications work.

The downside is that it is difficult to implement. Unlike metrics or logs, the value of distributed tracing is gained only after implementing it across multiple applications. If even one of your applications does not produce distributed tracing, the context propagation is broken and the value of the traces drops significantly.

We manually implemented distributed tracing for multiple companies, but found it a challenge to coordinate all the development teams to instrument their applications in order to achieve a complete distributed trace. Once the implementation was finished, we saw great value and fixed production issues much faster. But partial implementation wasn’t worth much.

We set out to automate this process. We knew how to do most of it, but the trickiest part was how to automatically instrument programs written in compiled languages (like Go). If we could do that, we would be able to automate the entire process of generating distributed traces. While researching, we realized that eBPF—a technology that allows the Linux kernel to load external programs for execution within the kernel—could be used to develop automatic instrumentation for compiled languages. That was the final piece of the puzzle, and with it we were able to develop Odigos.

Odigos first scans and recognizes all your running applications, then recognizes the programming language of each one and auto-instruments it accordingly, using eBPF and OpenTelemetry. In addition, it deploys collectors that buffer, filter, and deliver data to your chosen monitoring tool, and auto scales them according to the amount of traffic. This automation allows developers to enjoy distributed traces within minutes as opposed to manual effort which can take months to implement.

Automatic instrumentation across programming languages is not a trivial task, especially when dealing with static binaries (like the ones produced by the Go compiler). We built multiple mechanisms to make sure we inject the relevant headers in a secure and stable way. We developed a system that tracks functions and structs across different versions of open-source libraries. In addition, we developed a system that performs userspace memory management in eBPF. As a result, Odigos is the only solution that is able to automatically generate distributed traces for compiled languages like Go and Rust. While other solutions require users to be experts in OpenTelemetry or eBPF, our solution does not require prior knowledge of observability technologies.

Our solution can be installed on any Kubernetes cluster by executing a single command. Once installed, we detect the programming language of every running application and apply the relevant instrumentation. For JIT languages (Java and .NET) or interpreted languages (JavaScript and Python) we deploy OpenTelemetry instrumentation. For compiled languages (Go, Rust, C) we deploy our eBPF-based instrumentation. All of this is abstracted from the user, who only has to: (1) select any or all of their target applications and (2) select a backend to send monitoring data to.

In May 2022, we released our first open-source project: automatic instrumentation for Go applications, based on eBPF. We later donated this project to the OpenTelemetry community and it is currently being developed as part of the Go Automatic Instrumentation SIG.

We are big believers in open standards, therefore the instrumentation and collectors used by Odigos are all based on open-source projects developed by the OpenTelemetry community. This also enables us to be vendor-agnostic.

Currently we are focused on building our open-source project. There are no pricing or paid features yet, but in the future, we are planning to offer a managed version of Odigos that will include enterprise features.

If you're interested to learn more, check out our docs (https://docs.odigos.io), watch a demo video (https://www.youtube.com/watch?v=9d36AmVtuGU), and visit our website (https://odigos.io).

We’d love to hear your experiences with tracing and monitoring distributed applications and anything else you’d like to share!

52 comments

[+] cube2222|3 years ago|reply

Wow, if this really works like you describe, then this is magic!

> Automatic instrumentation across programming languages is not a trivial task, especially when dealing with static binaries (like the ones produced by the Go compiler). We built multiple mechanisms to make sure we inject the relevant headers in a secure and stable way. We developed a system that tracks functions and structs across different versions of open-source libraries.

Could be very useful for non-greenfield projects. I'd love to learn more about the details, is there any writeup somewhere?

Though I'd still recommend new projects do "proper" tracing with not only one-per-service spans, but also spans for important functions, including additional application-specific tags, as that is easily 10x the value.

But since life is a sequence of tradeoffs, I think this project could be really useful in a lot of places.

[+] phillipcarter|3 years ago|reply

> Though I'd still recommend new projects do "proper" tracing with not only one-per-service spans, but also spans for important functions, including additional application-specific tags, as that is easily 10x the value.

FWIW Odigos makes this possible because it uses OpenTelemetry (and generates OTel-compatible instrumentation for the eBPF-sourced data). You can go into an app that's instrumented this way, add an OpenTelemetry SDK, and start writing manual instrumentation or include additional instrumentation libraries. Your traces will just get deeper/richer when you do that.

[+] edenfed|3 years ago|reply

We are actually doing technical deep dive on the next meeting of the OpenTelemetry Go auto instrumentation meeting in Tuesday. Will be happy to share the presentation afterwards.

In addition, we automatically create spans for popular open source libraries in use so you should also expect to see spans for database connections / cloud SDKs/ Kafka clients / etc. Definitely agree that manual instrumentation is very important in addition to the automatic one

[+] intelVISA|3 years ago|reply

With tech like eBPF dynamic instrumentation is surprisingly easy actually.

Still, always glad to see some innovation in this space.

[+] mdaniel|3 years ago|reply

Congratulations on the launch, and thank you for choosing an awesome license!

For an unrelated reason, today I was reminded about Pixie (https://news.ycombinator.com/item?id=25375170 and https://news.ycombinator.com/item?id=31687978 and https://github.com/pixie-io/pixie#readme ), which says is also an ebpf kubernetes observability tool, also Apache licensed.

I suspect the difference may be your aspirations to move out of just kubernetes, but I wondered if that's the biggest difference between your project and theirs? Or maybe the C++ versus golang?

[+] edenfed|3 years ago|reply

As far as I know Pixie use eBPF for generating metrics. Odigos is focused on generating distributed traces which is a different signal that spans across multiple applications

[+] thorgaardian|3 years ago|reply

Looks awesome! I hadn't had the chance to dive into eBPF yet, but I had hoped someone would be able to use it in a clever way like this!

I was digging through the docs and it looks like you have custom language detection. Did you consider trying to extract the language detection features from buildpack to do this? I imagine you'd get more reliable results and less to maintain if you used that as the basis.

[+] edenfed|3 years ago|reply

Yes we are actually using a combination of env vars / process names / linked libraries and container metadata to detect the language

[+] yashap|3 years ago|reply

Very cool!

I'd imagine the challenge here is the long tail of tracing and metric needs. I'm thinking things like:

- For the JVM, do you support things like thread pools and execution contexts well? e.g. say part of serving up a response to an HTTP request means executing some async work against an execution context, does the context propagation work properly? And if so, would this work for other JVM languages, like Scala, or just Java? When I've manually instrumented apps for context propagation, it's been easy for languages like JS (Node) and PHP, but hard for languages like Scala, where there are so many different concurrency models ppl use

- Some units of work/tracing are pretty standardized, like say serving up a response to an HTTP request. But others less so, for example work triggered by job queues/events, where essentially a message on some sort of Kafka/Redis/Postgres/whatever queue triggers your app to do some work (instead of an HTTP request). I have trouble seeing how Odigos would instrument this well - e.g. even if you detect the work, how do you label related metrics well (can't just rely on HTTP method/path)? How do you measure success/failure of the job? Or if you don't try to tackle this sort of use case, would there be something like Odigos libs for manual instrumentation, where necessary?

[+] edenfed|3 years ago|reply

We are actually able to handle the long tail of tracing by leveraging the amazing open source community. For languages like Java we use the automatic instrumentation created by the OpenTelemetry community which is really great and support ton of libraries, you can see a list of supported libraries here: https://github.com/open-telemetry/opentelemetry-java-instrum... This also allows us to support async tracing like doing context propagation over Kafka message is also something we support (depending on the programming language)

[+] jedberg|3 years ago|reply

This is awesome! Request tracing is basically the fundamental building block to observability in a distributed system.

Doing it automatically is a huge win!

Congrats on the launch and I look forward to learning more!

[+] edenfed|3 years ago|reply

Thank you for the feedback! We believe a lot of innovation can happen with distributed traces, and Odigos is just the beginning

[+] stavros|3 years ago|reply

I am very amused by your choice of name, as Odigos is to land what Kubernetes is to sea.

[+] yasuocidal|3 years ago|reply

in greek it means Driver, the one who drives lol Edit: just saw your username, im pretty sure you already knew xD

[+] jzelinskie|3 years ago|reply

Congrats on the launch! OpenTelemetry/Distributed Tracing has been in dire need of quality of life improvements, so I'm glad to see more folks filling in the gaps.

I see you're injecting trace IDs into programs. How do you guarantee that this doesn't break the binary or flag any security/compliance requirements?

[+] edenfed|3 years ago|reply

This is something we are thinking about a lot. We developed multiple mechanisms to make sure we inject the IDs in a safe way. You can see the code here: https://github.com/keyval-dev/opentelemetry-go-instrumentati...

[+] phillipcarter|3 years ago|reply

> dire need of quality of life improvements

Agreed! I'm one of the maintainers of part of the project - what sorts of things are top of mind for you w.r.t. quality of life improvements?

[+] Benjamin_Dobell|3 years ago|reply

This is really cool. Upon further Googling, readers may be interested in https://kubernetes.io/blog/2017/12/using-ebpf-in-kubernetes/

If you can go beyond Kubernetes, I think that'd give Odigos more staying power. Naturally some integrations are out of your hands, AWS Fargate being one (https://github.com/aws/containers-roadmap/issues/1027). However, if you could get integrations up and running with the likes of Fargate, Fly.io, Render.com etc. That'd be amazing.

[+] edenfed|3 years ago|reply

Support for non-Kubernetes environments is something we are planning to release very soon.

[+] william-evans|3 years ago|reply

This is really cool - given my perception of the target market it might be worth targeting AWS Elastic Container Service (ECS) next as the userbase there, I would imagine, is generally looking for less-complex solutions (given the complexity difference between Kubernetes and ECS).

[+] edenfed|3 years ago|reply

ECS is definitely on our roadmap!

[+] avinassh|3 years ago|reply

I was thinking of giving it a try, but why does it have Datadog as a prerequisite

> A Datadog account with API key. Go to Datadog website to create a new free account. In addition, create a new API key by navigating to Organization settings, then click on API keys, and create a new key.

https://docs.odigos.io/prerequisites

[+] tecleandor|3 years ago|reply

I thin that prerequisite is only for that tutorial.

If I understood correctly, Odigos supports a bunch of observability backends and, instead of Datadog, you could use Jaeger, Splunk or Open Telemetry (for example).

https://github.com/keyval-dev/odigos/blob/main/DESTINATIONS....

[+] edenfed|3 years ago|reply

This is not a requirement, sorry for the misleading documentation. We just rewritten everything and this bullet is probably a leftover from previous version of the docs. fixing it now.

[+] nate908|3 years ago|reply

Interesting, it looks like you've put some hard work into this project. My question is, what if a pod has multiple containers in it? How does Odigos choose which icon/programming language that is displayed for the pod? For example, I have a Deployment that runs pods with two containers: a php-fpm container and a nginx container. Would the "Choose Target Applications" page show an icon for both Nginx and PHP for the given Deployment? Would Odigos report separate metrics to the backend Desination for both PHP and Nginx?

[+] edenfed|3 years ago|reply

Odigos will be able to instrument both containers each with the relevant instrumentation. As you pointed out, there is currently a bug in the UI that shows just one programming language per pod. Working on fixing it soon

[+] rapidlua|3 years ago|reply

The BPF instrumentation is quite cool! I wonder if uprobes have a performance impact. Does it roughly compare to a single syscall?

https://github.com/keyval-dev/opentelemetry-go-instrumentati...

[+] Bayart|3 years ago|reply

Really nice ! I was looking into implementing tracing for a few projects I'm being onboarded on, and it seems to solve the "ask the devs nicely to implement OpenTelemetry are at least merge my commits" part.

My "gut instinct" would be to export that to Jaeger, but I'm open to suggestions as to better alternatives. We're on GCP so it might be an opportunity to try Google Cloud Trace as well.

[+] massimosgrelli|3 years ago|reply

Amazing project. Is everything open source? Are you planning anything for the big enterprise who wants to pay for the service?

[+] edenfed|3 years ago|reply

Thank you! Yes, we are working on adding enterprise features.

[+] ygouzerh|3 years ago|reply

Wow congrats guys, that is a game changer ! That have the possibility of becoming a standard in some companies I worked with

[+] earthling8118|3 years ago|reply

Are there any plans to branch out past Kubernetes? I'd be very interested in Odigos but I have separated myself from Kubernetes and am all in on Nomad. I've been looking at how I want to handle tracing and telemetry and this seems it'd be a great fit except for that minor detail

[+] edenfed|3 years ago|reply

Definitely. Nomad is probably the first environment we are going to support after Kubernetes

[+] Ancient|3 years ago|reply

Just saw the demo video, looks awesome. Is this tool from the future or some dark wizard tricks? Keep up the great work.

[+] hinkley|3 years ago|reply

Distributed tracing really ought to be built into every web application framework. What's the value in signing over your autonomy to a framework if it isn't going to handle cross-cutting concerns like forwarding correlation IDs from the inbound request to all outbound requests triggered by that request?

[+] edenfed|3 years ago|reply

Unfortunately not all web frameworks do this automatically. In addition sometimes you may want to propagate ID over non http connections like database drivers or even message queues.

[+] decisionSniper|3 years ago|reply

I'm curious how this will stack up against Sysdig/Falco - https://sysdig.com/blog/sysdig-and-falco-now-powered-by-ebpf....

eBPF for the win, this is a nice approach with Odigos.

[+] debarshri|3 years ago|reply

I don't think it is comparable to falco. Talk is more about security violations of the container. It is not related to distributed tracing.

[+] edenfed|3 years ago|reply

Falco is really cool project but it focuses more on security. Odigos is focused on getting better monitoring signals from your applications, especially distributed tracing

[+] theptip|3 years ago|reply

Looks cool! Great to see entrants into this space.

How does this compare with Cilium? Looks like they do OT tracing (https://github.com/cilium/hubble-otel) but it's not native/core, is that the main distinction?

[+] edenfed|3 years ago|reply

As far as I know cilium does not do automatic context propagation and require code changes to achieve it. Odigos automatically do context propagation