we spent a few months building evals for a health agent (and the agent itself!). tried to apply anthropic's framework to a real system looking at CGM data + diet.
some of it worked. we got decent at checking form — citations exist, tools were called, numbers trace back. the harder part was essence — is this clinically appropriate? actually helpful? we didn't really solve that.
curious if others building health/bio agents have found ways around this, or if everyone's just accepting fuzzy metrics for the stuff that matters.
No comments yet.