Do you happen to have a link with a more nuanced technical analysis of that (emergent) behavior? I’ve read only the pop-news version of that “escaping” story.
There is none. We don't understand LLMs well enough to be able to conduct a full fault analysis like this.
We can't trace the thoughts of an LLM the way we can trace code execution - the best mechanistic interpretability has to offer is being able to get glimpses occasionally. The reasoning traces help, but they're still incomplete.
Is it pattern-matching? Is it acting on its own internal goals? Is it acting out fictional tropes? Were the circumstances of the test scenarios intentionally designed to be extreme? Would this behavior have happened in a real world deployment, under the right circumstances?
The answer is "yes", to all of the above. LLMs are like that.
ACCount37|4 months ago
We can't trace the thoughts of an LLM the way we can trace code execution - the best mechanistic interpretability has to offer is being able to get glimpses occasionally. The reasoning traces help, but they're still incomplete.
Is it pattern-matching? Is it acting on its own internal goals? Is it acting out fictional tropes? Were the circumstances of the test scenarios intentionally designed to be extreme? Would this behavior have happened in a real world deployment, under the right circumstances?
The answer is "yes", to all of the above. LLMs are like that.
fragmede|4 months ago
https://www.anthropic.com/research/agentic-misalignment
https://assets.anthropic.com/m/6d46dac66e1a132a/original/Age...