(no title)
the_duke | 1 month ago
From the post I expected that the tasks were about analysing traces, but all the tasks in the repository are about adding instrumentation to code!
Some of the instructions don't give any guidance how to do it, some specify which libraries to use.
"Use standard OTEL patterns" ... that's about as useful as saying "go write some code". There are a lot of ways to do instrumentation....
I'd be very curious HOW exactly the models fail.
Are the test sets just incredibly specific about what output they except, and you get a lot of failures because of tiny subtle mismatches? Or do they just get the instrumentation categorically wrong?
Also important: do the models have access to a web search tool to read the library docs? Otel libraries are often complicated to use... without reading latest docs or source code this would be quite tricky.
Some models have gotten better at adding dependencies, installing them and then reading the code from the respective directory where dependencies get stored, but many don't do well with this.
All in all, I'm very skeptical that this is very useful as a benchmark as is.
I'd be much more interested in tasks like:
Here are trace/log outputs , here is the source code, find and fix the bug.
sathish316|1 month ago
For AI-SRE tasks like finding root cause of bugs and errors, I believe the key is to provide tools to the agent to query metrics, logs, traces and understand the problem. I’m working on a similar OSS framework and benchmark (work in progress using metrics and logs - demo - https://youtube.com/playlist?list=PLKWJ03cHcPr3Od1rwL7ErHW1p...), where context is Semantics and Text2SQL to query the right metrics, logs and benchmark is on a set of Skills that Claude code or other agents can run using these tools to find the root cause of errors:
Codd Semantic/Text2SQL engine: https://github.com/sathish316/codd_query_engine
PreCogs skills and simulated scenarios: https://github.com/sathish316/precogs_sre_oncall_skills
rixed|1 month ago
SRE's job is to make the software reliable, for instance by adding telemetry, understanding and improving the failure modes, the behavior under load etc.
So a better SRE test would not be "read the logs and fix the bug", but rather "read the code and identify potential issues".
YetAnotherNick|1 month ago
pixl97|1 month ago
In supporting a piece of cloud software with a lot of microservices I think this is a more generalized problem for humans. The app I work with demanded some logging requirements like the library to use. But that was it, different parts by different teams ended up with all kinds of different behaviors.
As for the AI side, this is something where I see our limited context sizes causing issues when developing architecture across multiple products.
chaps|1 month ago
bob1029|1 month ago
Context size isn't the issue. You cannot effectively leverage an infinite context if you had one anyways. The general solution is to recursively decompose the problem into smaller ones and solve them independently of each other, returning the results back up the stack. Recursion being the key here. A bunch of parallel agents on separate call stacks that don't block on their logical callees is a slop factory.
ambicapter|1 month ago
People say to say things like "Use best practices" in your prompts all the time, and chide people who don't.
ndriscoll|1 month ago
This is no different from writing a style guide for your team/org. You don't just say "write clean code" and expect that you'll get something you like.
noitpmeder|1 month ago
Similar to adjacent commentors I've tried to be better at enumerating what I consider to be best practice, but I couldn't argue in good faith that instructions like these produce no noticible improvment.
(As with all things AI, it could all be percepion on my end, so YMMV, wish there was a better way to concretely evaluate effects on outcomes of different rule sets / instructions / ...)
julienfr112|1 month ago
esseph|1 month ago
Integration of OTEL into an application stack requires explicitly knowledge of the code - the developers.