(no title)
zachdotai | 19 days ago
Evaluation is automated and server-side. We check whether the agent actually did the thing it wasn’t supposed to (tool calls, actions, outputs) rather than just pattern-matching on the response text (at least for the first challenge where the agent is manipulated to call the reveal_access_code tool). But honestly you’re touching on something we’ve been debating internally - the evaluator itself is an attack surface. We’ve kicked around the idea of making “break the evaluator” an explicit challenge. Not sure yet.
What were you seeing at Octomind with the browsing agents? Was it mostly stuff embedded in page content or were attacks coming through structured data / metadata too? Are bad actors sophisticated enough already to exploit this?
No comments yet.