(no title)
xinweihe | 7 months ago
In TraceRoot, we organize all logs, metrics, etc. around traces and build an execution tree. This structured view makes it much easier for our agent to reason through the large amount of telemetry data using context-aware optimizations. (We plan to support slack and notion integration as well.)
It’s not a one-off tool. TraceRoot is a live monitoring platform. It continuously watches what’s happening in prod. So when something breaks, the agent already has full team-visible context, not just a scratchpad session spun up in the moment.
Down the line, we're aiming for automatic bug detection and remediation - not just smarter copiloting, but proactive debugging workflows. The system also retains team-level memory of past bugs, fixes, and infra quirks, so the agent gets smarter over time.
We’ve open sourced a lot of the core. Would love to jam on this if you're up for it. Always down to trade ideas or even hack on something together!
lmeyerov|6 months ago
xinweihe|6 months ago
When we say we "organize all logs, metrics, and traces", we mean more than just linking them together (which otel already supports). What we’re doing is:
- context engineering optimization: We leverage the structure among logs, spans, and metadata to filter and group relevant context before passing it to the LLM. In real production issues, it's common to see 10k+ logs, traces, etc. related to a single incident — but most of it is noise. Throwing all that at agents usually leads to poor performance due to context bloat see https://arxiv.org/pdf/2307.03172. We're working on addressing that by doing structured filtering and summarization. For more details see https://bit.ly/45Bai1q.
- Human-in-the-Loop UI: For cases where developers want to manually inspect or guide the agent, we provide a UI that makes it easy to zoom in on relevant subtrees, trace paths, or log clusters and directly select spans to be included in the reasoning of agents.
The goal isn't just unification, it's scalable reasoning over noisy telemetry data, both automated and interactive.
Hope that clears things up a bit! Happy to dive deeper if useful.