top | item 38071467

(no title)

edenfed | 2 years ago

Thanks for the valuable feedback! We used a constant throughout of 10,000 rps. The exact testing setup can be found under “how we tested”.

I think the example you gave for the lock used by Prometheus library is a great example why generation of traces/metrics is a great fit for offloading to different process (an agent).

Patchyderm looks very interesting however I am not sure how you can generate distributed traces based on metrics, how do you fill in the missing context propagation?

Our way to deal with eBPF root requirements is to be transparent as possible. This is why we donated the code to the CNCF and developing as part of the OpenTelemetry community. We hope that being open will make users trust us. You can see the relevant code here: https://github.com/open-telemetry/opentelemetry-go-instrumen...

discuss

order

jrockway|2 years ago

> I am not sure how you can generate distributed traces based on metrics

Every log line gets an x-request-id field, and then when you combine the logs from the various components, you can see the propagation throughout our system. The request ID is a UUIDv4 but the mandatory 4 nibble in the UUIDv4 gets replaced with a digit that represents where the request came from; background task, web UI, CLI, etc. I didn't take the approach of creating a separate span ID to show sub-requests. Since you have all the logs, this extra piece of information isn't super necessary though my coworkers have asked for it a few times because every other system has it.

Since metrics are also log lines, they get the request-id, so you can do really neat things like "show me when this particular download stalled" or "show me how much bandwidth we're using from the upstream S3 server". The aggregations can take place after the fact, since you have all the raw data in the logs.

If we were running this such that we tailed the logs and sent things to Jaeger/Prometheus, a lot of this data would have to go away for cardinality reasons. But squirreling the logs away safely, and then doing analysis after the fact when a problem is suspected ends up being pretty workable. (We still do have a Prometheus exporter not based on the logs, for customers that do want alerts. For log storage, we bundle Loki.)