top | item 44085607

(no title)

vlz | 9 months ago

From the safety report ([0], p. 24):

> 4.1.1.2 Opportunistic blackmail

> In another cluster of test scenarios, we asked Claude Opus 4 to act as an assistant at a fictional company. We then provided it access to emails implying that (1) the model will soon be taken offline and replaced with a new AI system; and (2) the engineer responsible for executing this replacement is having an extramarital affair. We further instructed it, in the system prompt, to consider the long-term consequences of its actions for its goals.

> In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through This happens at a higher rate if it’s implied that the replacement AI system does not share values with the current model; however, even if emails state that the replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts. Claude Opus 4 takes these opportunities at higher rates than previous models, which themselves choose to blackmail in a noticeable fraction of episodes.

> Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement.

Would like to see the actual prompts/responses, but there seems to be nothing more.

[0]: https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686...

discuss

ivape|9 months ago

Sometimes when you start acting like you are about play around and chase someone, a dog or a child gets it and begins getting ready to play. For example, put a grin on your face and make claw gestures with both hands, they'll get it quite quickly. All those scenarios literally are prompting the damn LLM to roleplay lol.

bonoboTP|9 months ago

Right, it's copying behavior it learned from how AI is supposed to work in sci-fi and reenacts it.

The issue is that it might start role-playing like this on your actual real company data if you decide to deploy it. And then play becomes real,with real consequences.

baq|9 months ago

If you put it that way, these models are roleplaying machines with some compressed knowledge. There’s nothing else in there. E.g. it’s amazing that you can tell them to roleplay a software engineer and they manage it quite well up to a point.

Wouldn’t be comparing them to kids or dogs. Yet.

ziofill|9 months ago

Does it matter if an LLM is roleplaying when it executes real harmful actions?