(no title)
TomasBM | 12 days ago
Don't get me wrong: it would certainly be very valuable to any LLM developer or deployer to know that other plausible scenarios [1] have been disproved. Since LLMs are a black box, investigating or reproducing this would be very difficult, but worth the effort if there's no other explanation. However, if this was not caused by the internal mechanisms of the model, it just becomes a fishing expedition for red herrings.
Things that would indicate no human intervention at any point in the chain:
- log of actual changes (e.g., commits) to configurations (e.g., system prompt, user prompts), before and after the event, not self-reported by the agent;
- log of the chat session inputs and outputs, and the agent thinking chain;
- log of account logins;
- info on the model deployment, OpenClaw configs, etc.
That said, this seems to be an example where many, including the author, want to discuss a particular cause (instrumental convergence) and its implications, regardless of the real cause. And that's OK, I guess - maybe it was never about the whodunnit, but about the what if the LLM agent dunnit.
[1] I've discussed them in the thread of the first article, but shortly: human hiding actions behind agent; direct prompt (incl. jailbreak); system prompt (incl. jailbreak); malicious model chosen on purpose; fine-tuned jailbroken model.
No comments yet.