top | item 45577108

(no title)

e1g | 4 months ago

Do you happen to have a link with a more nuanced technical analysis of that (emergent) behavior? I’ve read only the pop-news version of that “escaping” story.

discuss

ACCount37|4 months ago

There is none. We don't understand LLMs well enough to be able to conduct a full fault analysis like this.

We can't trace the thoughts of an LLM the way we can trace code execution - the best mechanistic interpretability has to offer is being able to get glimpses occasionally. The reasoning traces help, but they're still incomplete.

Is it pattern-matching? Is it acting on its own internal goals? Is it acting out fictional tropes? Were the circumstances of the test scenarios intentionally designed to be extreme? Would this behavior have happened in a real world deployment, under the right circumstances?

The answer is "yes", to all of the above. LLMs are like that.

fragmede|4 months ago

You might have missed the appendix the Anthropic blog post linked to, which has additional detail.

https://www.anthropic.com/research/agentic-misalignment

https://assets.anthropic.com/m/6d46dac66e1a132a/original/Age...