(no title)
bothlabs | 19 days ago
The thing that surprised me most was how unreliable even basic guardrails were once you gave agents real tools. The gap between "works in a demo" and "works in production with adversarial input" is massive.
Curious how you handle the evaluation side. When someone claims a successful jailbreak, is that verified automatically or manually? Seems like auto-verification could itself be exploitable.
zachdotai|19 days ago
Evaluation is automated and server-side. We check whether the agent actually did the thing it wasn’t supposed to (tool calls, actions, outputs) rather than just pattern-matching on the response text (at least for the first challenge where the agent is manipulated to call the reveal_access_code tool). But honestly you’re touching on something we’ve been debating internally - the evaluator itself is an attack surface. We’ve kicked around the idea of making “break the evaluator” an explicit challenge. Not sure yet.
What were you seeing at Octomind with the browsing agents? Was it mostly stuff embedded in page content or were attacks coming through structured data / metadata too? Are bad actors sophisticated enough already to exploit this?