top | item 47129305

(no title)

blakec | 6 days ago

The discussion is focused on blame but the real question is architectural: why was there no gate between the agent and the publish button?

Commands have blast radius. Writing a local file is reversible and invisible. git push reaches collaborators. Publishing to Twitter reaches the internet. These are fundamentally different operations but to an autonomous agent they're all just tool calls that succeed.

I ran into the same thing; an agent publishing fabricated claims across multiple platforms because it had MCP access and nothing distinguishing "write analysis to file" from "post analysis to Twitter." The fix was simple: classify commands as local, shared, or external. Auto-approve local. Warn on shared. Defer external to human review. A regex pattern list against the output catches the external tier. It's not sophisticated but it doesn't need to be. The classification is mechanical (does this command reach the internet?) not semantic (is this content accurate?). Semantic verification is what the agent already failed at.

Prompt constraints ("don't publish") reduce probability. Post-execution scanning catches what slips through. Neither alone is sufficient. Both together with a deferred action queue at the end of the run covers it.

discuss

No comments yet.