> We find testing and evals to be the hardest problem here. This is not entirely surprising, but the agentic nature makes it even harder. Unlike prompts, you cannot just do the evals in some external system because there’s too much you need to feed into it. This means you want to do evals based on observability data or instrumenting your actual test runs. So far none of the solutions we have tried have convinced us that they found the right approach here.I'm curious about the solutions the op has tried so far here.
hommes-r|3 months ago
In general, a more generic eval setup is needed, with minimal requirements from AI engineers, if we want to move forward from Vibe's reliability engineering practices as a sector.
ColinEberhardt|3 months ago
heljakka|3 months ago
radarsat1|3 months ago
In my case I was until recently working on TTS and this was a huge barrier for us. We used all the common signal quality and MOS-simulation models that judged so called "naturalness" and "expressiveness" etc. But we found that none of these really helped us much in deciding when one model was better than another, or when a model was "good enough" for release. Our internal evaluations correlated poorly with them, and we even disagreed quite a bit within the team on the quality of output. This made hyperparameter tuning as well as commercial planning extremely difficult and we suffered greatly for it. (Notice my use of past tense here..)
Having good metrics is just really key and I'm now at the point where I'd go as far as to say that if good metrics don't exist it's almost not even worth working on something. (Almost.)
heljakka|3 months ago
We believe you need to both automatically create the evaluation policies from OTEL data (data-first) and to bring in rigorous LLM judge automation from the other end (intent-first) for the truly open-ended aspects.
ramraj07|3 months ago
verdverm|3 months ago
https://google.github.io/adk-docs/evaluate/
tl;dr - challenging because different runs produce different output, also how do you pass/fail (another LLM/agent is what people do)