One thing I haven't seen discussed much is enforcement at the tool-call boundary, specifically the moment between 'the model wants to call this tool' and 'the tool actually executes.' Most safety approaches focus on content (output filtering, prompt injection) or infrastructure (sandboxing, permissions). But there's a gap for runtime behavior: detecting that the agent is stuck in a jitter loop (rephrasing the same query slightly each time), that a side-effect tool fired twice with near-identical args, or that the model is stalling with apologetic non-progress text. I've been building something in this space and found that deterministic checks (HMAC signatures on tool+args, overlap coefficients on outputs) catch most of these without needing another LLM call.
No comments yet.