top | item 40682695

(no title)

With only sampled traces though it’s very hard to understand the impact of the problem. There are some bad traces but is it affecting 5%, 10% or 90% of your customers. Metrics shine there.

discuss

remram|1 year ago

Whether it is affecting 5% or 10% of your customers, if it is erroring at that rate you are going to want to find the root cause ASAP. Traces let you do that, whereas the precise number does nothing. I am a big supporter of metrics but I don't see this as the use case at all.

Jemaclus|1 year ago

(not your OP) This is true, but I find that metrics are useful whether something is going wrong or not (metrics that show 100% success are useful in determining baselines and what "normal" is), whereas collecting traces _when nothing is going wrong_ is not useful -- it's just taking up space and ingress, and thus costing me money.

My typical approach in the past has been to use metrics to determine when something is going wrong, then enable either tracing or logs (usually logs) to determine exactly what is breaking. For a dev or team that is highly connected to their software, simply knowing what was recently released is enough to zero in on problems without relying upon tracing.

Traces can be useful, but they're expensive relative to metrics, even if sampled at a very low rate.

pdimitar|1 year ago

Strange example, you'd think you want to fix this as quickly as humanly possible, no?

Also we don't sample traces, it's a fire hose of data aimed at the OTel collector. We do archive them / move them to colder and cheaper storage after a little time though, and we found that a viable money-saving strategy and a good balance overall.

aserafini|1 year ago

Not all problems result in error traces to analyse.

Example, you release buggy client that doesn't call "POST /order/finalize" when it should.

There are no error traces, there are just missing HTTP requests. Metrics reveal that calls to "POST /order/finalize" for iOS apps are down 50% WoW.