Launch HN: Odigos (YC W23) – Instant distributed tracing for Kubernetes clusters
Our earlier experiences with monitoring tools were frustrating. Monitoring a distributed system with multiple microservices, we found ourselves spending way too much time trying to locate the specific microservice that was at the root of a problem. For example, we once spent hours debugging an application which we suspected was causing high latency, only to find out that the actual problem was rooted in a completely different application
Then we learned about distributed tracing, which solves exactly this problem. Unlike metrics or logs that capture a data point in time in a single application, a distributed trace follows a request as it propagates through a distributed environment by tagging it with a unique ID. This allows developers to understand the context of each request and how their distributed applications work.
The downside is that it is difficult to implement. Unlike metrics or logs, the value of distributed tracing is gained only after implementing it across multiple applications. If even one of your applications does not produce distributed tracing, the context propagation is broken and the value of the traces drops significantly.
We manually implemented distributed tracing for multiple companies, but found it a challenge to coordinate all the development teams to instrument their applications in order to achieve a complete distributed trace. Once the implementation was finished, we saw great value and fixed production issues much faster. But partial implementation wasn’t worth much.
We set out to automate this process. We knew how to do most of it, but the trickiest part was how to automatically instrument programs written in compiled languages (like Go). If we could do that, we would be able to automate the entire process of generating distributed traces. While researching, we realized that eBPF—a technology that allows the Linux kernel to load external programs for execution within the kernel—could be used to develop automatic instrumentation for compiled languages. That was the final piece of the puzzle, and with it we were able to develop Odigos.
Odigos first scans and recognizes all your running applications, then recognizes the programming language of each one and auto-instruments it accordingly, using eBPF and OpenTelemetry. In addition, it deploys collectors that buffer, filter, and deliver data to your chosen monitoring tool, and auto scales them according to the amount of traffic. This automation allows developers to enjoy distributed traces within minutes as opposed to manual effort which can take months to implement.
Automatic instrumentation across programming languages is not a trivial task, especially when dealing with static binaries (like the ones produced by the Go compiler). We built multiple mechanisms to make sure we inject the relevant headers in a secure and stable way. We developed a system that tracks functions and structs across different versions of open-source libraries. In addition, we developed a system that performs userspace memory management in eBPF. As a result, Odigos is the only solution that is able to automatically generate distributed traces for compiled languages like Go and Rust. While other solutions require users to be experts in OpenTelemetry or eBPF, our solution does not require prior knowledge of observability technologies.
Our solution can be installed on any Kubernetes cluster by executing a single command. Once installed, we detect the programming language of every running application and apply the relevant instrumentation. For JIT languages (Java and .NET) or interpreted languages (JavaScript and Python) we deploy OpenTelemetry instrumentation. For compiled languages (Go, Rust, C) we deploy our eBPF-based instrumentation. All of this is abstracted from the user, who only has to: (1) select any or all of their target applications and (2) select a backend to send monitoring data to.
In May 2022, we released our first open-source project: automatic instrumentation for Go applications, based on eBPF. We later donated this project to the OpenTelemetry community and it is currently being developed as part of the Go Automatic Instrumentation SIG.
We are big believers in open standards, therefore the instrumentation and collectors used by Odigos are all based on open-source projects developed by the OpenTelemetry community. This also enables us to be vendor-agnostic.
Currently we are focused on building our open-source project. There are no pricing or paid features yet, but in the future, we are planning to offer a managed version of Odigos that will include enterprise features.
If you're interested to learn more, check out our docs (https://docs.odigos.io), watch a demo video (https://www.youtube.com/watch?v=9d36AmVtuGU), and visit our website (https://odigos.io).
We’d love to hear your experiences with tracing and monitoring distributed applications and anything else you’d like to share!
[+] [-] cube2222|3 years ago|reply
> Automatic instrumentation across programming languages is not a trivial task, especially when dealing with static binaries (like the ones produced by the Go compiler). We built multiple mechanisms to make sure we inject the relevant headers in a secure and stable way. We developed a system that tracks functions and structs across different versions of open-source libraries.
Could be very useful for non-greenfield projects. I'd love to learn more about the details, is there any writeup somewhere?
Though I'd still recommend new projects do "proper" tracing with not only one-per-service spans, but also spans for important functions, including additional application-specific tags, as that is easily 10x the value.
But since life is a sequence of tradeoffs, I think this project could be really useful in a lot of places.
[+] [-] phillipcarter|3 years ago|reply
FWIW Odigos makes this possible because it uses OpenTelemetry (and generates OTel-compatible instrumentation for the eBPF-sourced data). You can go into an app that's instrumented this way, add an OpenTelemetry SDK, and start writing manual instrumentation or include additional instrumentation libraries. Your traces will just get deeper/richer when you do that.
[+] [-] edenfed|3 years ago|reply
In addition, we automatically create spans for popular open source libraries in use so you should also expect to see spans for database connections / cloud SDKs/ Kafka clients / etc. Definitely agree that manual instrumentation is very important in addition to the automatic one
[+] [-] intelVISA|3 years ago|reply
Still, always glad to see some innovation in this space.
[+] [-] mdaniel|3 years ago|reply
For an unrelated reason, today I was reminded about Pixie (https://news.ycombinator.com/item?id=25375170 and https://news.ycombinator.com/item?id=31687978 and https://github.com/pixie-io/pixie#readme ), which says is also an ebpf kubernetes observability tool, also Apache licensed.
I suspect the difference may be your aspirations to move out of just kubernetes, but I wondered if that's the biggest difference between your project and theirs? Or maybe the C++ versus golang?
[+] [-] edenfed|3 years ago|reply
[+] [-] thorgaardian|3 years ago|reply
I was digging through the docs and it looks like you have custom language detection. Did you consider trying to extract the language detection features from buildpack to do this? I imagine you'd get more reliable results and less to maintain if you used that as the basis.
[+] [-] edenfed|3 years ago|reply
[+] [-] yashap|3 years ago|reply
I'd imagine the challenge here is the long tail of tracing and metric needs. I'm thinking things like:
- For the JVM, do you support things like thread pools and execution contexts well? e.g. say part of serving up a response to an HTTP request means executing some async work against an execution context, does the context propagation work properly? And if so, would this work for other JVM languages, like Scala, or just Java? When I've manually instrumented apps for context propagation, it's been easy for languages like JS (Node) and PHP, but hard for languages like Scala, where there are so many different concurrency models ppl use
- Some units of work/tracing are pretty standardized, like say serving up a response to an HTTP request. But others less so, for example work triggered by job queues/events, where essentially a message on some sort of Kafka/Redis/Postgres/whatever queue triggers your app to do some work (instead of an HTTP request). I have trouble seeing how Odigos would instrument this well - e.g. even if you detect the work, how do you label related metrics well (can't just rely on HTTP method/path)? How do you measure success/failure of the job? Or if you don't try to tackle this sort of use case, would there be something like Odigos libs for manual instrumentation, where necessary?
[+] [-] edenfed|3 years ago|reply
[+] [-] jedberg|3 years ago|reply
Doing it automatically is a huge win!
Congrats on the launch and I look forward to learning more!
[+] [-] edenfed|3 years ago|reply
[+] [-] stavros|3 years ago|reply
[+] [-] yasuocidal|3 years ago|reply
[+] [-] jzelinskie|3 years ago|reply
I see you're injecting trace IDs into programs. How do you guarantee that this doesn't break the binary or flag any security/compliance requirements?
[+] [-] edenfed|3 years ago|reply
[+] [-] phillipcarter|3 years ago|reply
Agreed! I'm one of the maintainers of part of the project - what sorts of things are top of mind for you w.r.t. quality of life improvements?
[+] [-] Benjamin_Dobell|3 years ago|reply
If you can go beyond Kubernetes, I think that'd give Odigos more staying power. Naturally some integrations are out of your hands, AWS Fargate being one (https://github.com/aws/containers-roadmap/issues/1027). However, if you could get integrations up and running with the likes of Fargate, Fly.io, Render.com etc. That'd be amazing.
[+] [-] edenfed|3 years ago|reply
[+] [-] william-evans|3 years ago|reply
[+] [-] edenfed|3 years ago|reply
[+] [-] avinassh|3 years ago|reply
> A Datadog account with API key. Go to Datadog website to create a new free account. In addition, create a new API key by navigating to Organization settings, then click on API keys, and create a new key.
https://docs.odigos.io/prerequisites
[+] [-] tecleandor|3 years ago|reply
If I understood correctly, Odigos supports a bunch of observability backends and, instead of Datadog, you could use Jaeger, Splunk or Open Telemetry (for example).
https://github.com/keyval-dev/odigos/blob/main/DESTINATIONS....
[+] [-] edenfed|3 years ago|reply
[+] [-] nate908|3 years ago|reply
[+] [-] edenfed|3 years ago|reply
[+] [-] rapidlua|3 years ago|reply
https://github.com/keyval-dev/opentelemetry-go-instrumentati...
[+] [-] Bayart|3 years ago|reply
My "gut instinct" would be to export that to Jaeger, but I'm open to suggestions as to better alternatives. We're on GCP so it might be an opportunity to try Google Cloud Trace as well.
[+] [-] massimosgrelli|3 years ago|reply
[+] [-] edenfed|3 years ago|reply
[+] [-] ygouzerh|3 years ago|reply
[+] [-] earthling8118|3 years ago|reply
[+] [-] edenfed|3 years ago|reply
[+] [-] Ancient|3 years ago|reply
[+] [-] hinkley|3 years ago|reply
[+] [-] edenfed|3 years ago|reply
[+] [-] decisionSniper|3 years ago|reply
eBPF for the win, this is a nice approach with Odigos.
[+] [-] debarshri|3 years ago|reply
[+] [-] edenfed|3 years ago|reply
[+] [-] theptip|3 years ago|reply
How does this compare with Cilium? Looks like they do OT tracing (https://github.com/cilium/hubble-otel) but it's not native/core, is that the main distinction?
[+] [-] edenfed|3 years ago|reply