(no title)
zachdotai | 17 days ago
The key insight for us was that most failures weren't safety-critical, they were the agent losing context mid-task. A targeted nudge recovers those. Generic "stay on track" prompts don't work; the correction needs to reference the original goal and what specifically drifted.
Steer vs. kill comes down to reversibility. If no side effects have occurred yet, steer. If the agent already made an irreversible call or wrote bad data, kill.
nordic_lion|16 days ago
In other words, what is the enforcement unit the policy is attached to in practice... a step, a plan node, a tool invocation, or the agent instance as a whole?
zachdotai|16 days ago
We tried coarser units (plan nodes, full steps) but drift compounds fast, by the time a step finishes, the agent may have already chained 3-4 bad calls. Tool-level gives the tightest correction loop. The cost is ~200ms latency per invocation. For hot paths we sample (every 3rd call, or only on tool-category changes) rather than evaluate exhaustively.