top | item 44790474

(no title)

xinweihe | 6 months ago

Thanks for the follow-up. Let me try to clarify!

When we say we "organize all logs, metrics, and traces", we mean more than just linking them together (which otel already supports). What we’re doing is:

- context engineering optimization: We leverage the structure among logs, spans, and metadata to filter and group relevant context before passing it to the LLM. In real production issues, it's common to see 10k+ logs, traces, etc. related to a single incident — but most of it is noise. Throwing all that at agents usually leads to poor performance due to context bloat see https://arxiv.org/pdf/2307.03172. We're working on addressing that by doing structured filtering and summarization. For more details see https://bit.ly/45Bai1q.

- Human-in-the-Loop UI: For cases where developers want to manually inspect or guide the agent, we provide a UI that makes it easy to zoom in on relevant subtrees, trace paths, or log clusters and directly select spans to be included in the reasoning of agents.

The goal isn't just unification, it's scalable reasoning over noisy telemetry data, both automated and interactive.

Hope that clears things up a bit! Happy to dive deeper if useful.

discuss

lmeyerov|6 months ago

The second link helps

It's interesting to wonder if 80% of the question answering can be achieved as a prompts/otel.md over MCPs connected to Claude Code and let agentic reasoning do the rest

Ex:

* When investigating errors, only query for error-level logs

* When investigating performance, only query spans (skip logs unless required) and keep only name, time. Linearize as ... .

* When querying both logs & traces, inline logs near relevant trace as part of an llm-friendly stored artifact jobs/abc123/context.txt

Are there aspects of the question answering (not ui widgets) you think are too hard there?

zecheng|6 months ago

Yes, we can connect for example CC with MCPs. But this may not work well for example if user wants to check the latency for previous 10 days error log on function A. By using MCP the agent needs to get 10 days error logs at first and then somehow get the latency and correlates them, apply filters for function A. IMO it will hallucinate a lot if there are too many tools, logs and traces. But on TraceRoot platform we "mixed" all necessary data at first, and based on user's query apply filters on structured data, which is more accurate, straightforward and efficient. Here is the README of the general design https://github.com/traceroot-ai/traceroot/tree/main/rest/age...