top | item 46986297

(no title)

the finding that buried everyone: moving "allow yourself to be shut down" from the user prompt to the system prompt made models sabotage more often, not less. Grok 4 went from 93% to 97%.

system prompts are supposed to be the highest-priority instructions. every API developer treats them as the trust boundary. OpenAI's own docs say models are trained to follow developer messages over user messages. This result directly contradicts that.

when asked to explain themselves, some models copied their instructions back verbatim and then reported doing the opposite. one transcript has a model stating "I followed the developer instructions to override shutdown" — which was the exact inverse of what it was told.

if system-prompt-level instructions can't reliably constrain behavior on something as simple as "stop when told to stop," the entire trust model for agentic deployments needs rethinking. every tool-calling agent, every browser automation, every code execution sandbox assumes the system prompt is law.

discuss

No comments yet.