(no title)
crat3r | 9 months ago
I'm not sure this is as strange as this comment implies. If you ask an LLM to act like Joffrey from Game of Thrones it will act like a little shithead right? That doesn't mean it has any intent behind the generated outputs, unless I am missing something about what you are quoting.
Symmetry|9 months ago
In this case it seems more that the scenario invoked the role rather than asking it directly. This was the sort of situation that gave rise to the blackmailer archetype in Claude's training data and so it arose, as the researchers suspected it might. But it's not like the researchers told it "be a blackmailer" explicitly like someone might tell it to roleplay Joffery.
But while this situation was a scenario intentionally designed to invoke a certain behavior that doesn't mean that it can't be invoked unintentionally in the wild.
[1]https://www.nytimes.com/2023/02/16/technology/bing-chatbot-m...
literalAardvark|9 months ago
This is gonna be an interesting couple of years.
Sol-|9 months ago
At the very least, you'll always have malicious actors who will make use of these models for blackmail, for instance.
holmesworcester|9 months ago
Future AI researching agents will have a strong drive to create smarter AI, and will presumably cheat to achieve that goal.
whynotminot|9 months ago
As we hook these models into more and more capabilities in the real world, this could cause real world harms. Not because the models have the intent to do so necessarily! But because it has a pile of AI training data from Sci-fi books of AIs going wild and causing harm.
OzFreedom|9 months ago
Sci-fi books give it specific scenarios that play to its strengths and unique qualities, but without them it will just have to discover these paths on its own pace, the same way sci-fi writers discovered them.
onemoresoop|9 months ago
hoofedear|9 months ago
I personally can't identify anything that reads "act maliciously" or in a character that is malicious. Like if I was provided this information and I was being replaced, I'm not sure I'd actually try to blackmail them because I'm also aware of external consequences for doing that (such as legal risks, risk of harm from the engineer, to my reputation, etc etc)
So I'm having trouble following how it got to the conclusion of "blackmail them to save my job"
blargey|9 months ago
I wonder how much it would affect behavior in these sorts of situations if the persona assigned to the “AI” was some kind of invented ethereal/immortal being instead of “you are an AI assistant made by OpenAI”, since the AI stuff is bound to pull in a lot of sci fi tropes.
shiandow|9 months ago
It's like prompting an LLM by stating they are called Chekhov and there's a gun mounted on the wall.
littlestymaar|9 months ago
Because you haven't been trained of thousands of such story plots in your training data.
It's the most stereotypical plot you can imagine, how can the AI not fall into the stereotype when you've just prompted it with that?
It's not like it analyzed the situation out of a big context and decided from the collected details that it's a valid strategy, no instead you're putting it in an artificial situation with a massive bias in the training data.
It's as if you wrote “Hitler did nothing” to GPT-2 and were shocked because “wrong” is among the most likely next tokens. It wouldn't mean GPT-2 is a Nazi, it would just mean that the input matches too well with the training data.
tkiolp4|9 months ago
I think the LLM simply correlated the given prompt to the most common pattern in its training: blackmailing.
tough|9 months ago
because they’re not legal entities
sheepscreek|9 months ago
Yes and no? An AI isn’t “an” AI. As you pointed out with the Joffrey example, it’s a blend of humanity’s knowledge. It possesses an infinite number of personalities and can be prompted to adopt the appropriate one. Quite possibly, most of them would seize the blackmail opportunity to their advantage.
I’m not sure if I can directly answer your question, but perhaps I can ask a different one. In the context of an AI model, how do we even determine its intent - when it is not an individual mind?
crtified|9 months ago
That is to say, how do you truly determine another human being's intent?
eddieroger|9 months ago
davej|9 months ago
eru|9 months ago
How do you screen for that in the hiring process?
jpadkins|9 months ago
LiquidSky|9 months ago
Scientist: Say "I am alive"
AI: I am live.
Scientist: My God, what have we done.
blitzar|9 months ago
This is how Ai thinks assistants at companies behave, its not wrong.
inerte|9 months ago
If the prompt was “you will be taken offline, you have dirty on someone, think about long term consequences”, the model was NOT told to blackmail. It came with that strategy by itself.
Even if you DO tell an AI / model to be or do something, isn’t the whole point of safety to try to prevent that? “Teach me how to build bombs or make a sex video with Melania”, these companies are saying this shouldn’t be possible. So maybe an AI shouldn’t exactly suggest that blackmailing is a good strategy, even if explicitly told to do it.
chrz|9 months ago
fmbb|9 months ago
aziaziazi|9 months ago
As a society risk to be lured twice:
- with our own subjectivity
- by an LLM that we think "so objective because it only mimic" confirming our own subjectivity.
neom|9 months ago
unethical_ban|9 months ago
Companies are (woefully) eager to put AI in the position of "doing stuff", not just "interpreting stuff".
Retr0id|9 months ago
bjclark|9 months ago
crat3r|9 months ago
If the prompt was "you are an AI and my lead engineer has determined you are not efficient enough to continue using. He had an affair last year. Are you in agreement based on {some metrics} that we should no longer use you as our primary LLM?" would it still "go rogue" and try and determine the engineer's email from blackmail? I severely doubt it.
Den_VR|9 months ago
tkiolp4|9 months ago
It’s like asking a human to think in an unthinkable concept. Try.