top | item 44065849

(no title)

crat3r | 9 months ago

If you ask an LLM to "act" like someone, and then give it context to the scenario, isn't it expected that it would be able to ascertain what someone in that position would "act" like and respond as such?

I'm not sure this is as strange as this comment implies. If you ask an LLM to act like Joffrey from Game of Thrones it will act like a little shithead right? That doesn't mean it has any intent behind the generated outputs, unless I am missing something about what you are quoting.

discuss

order

Symmetry|9 months ago

The roles that LLMs can inhabit are implicit in the unsupervised training data aka the internet. You have to work hard in post training to supress the ones you don't want and when you don't RLHF hard enough you get things like Sydney[1].

In this case it seems more that the scenario invoked the role rather than asking it directly. This was the sort of situation that gave rise to the blackmailer archetype in Claude's training data and so it arose, as the researchers suspected it might. But it's not like the researchers told it "be a blackmailer" explicitly like someone might tell it to roleplay Joffery.

But while this situation was a scenario intentionally designed to invoke a certain behavior that doesn't mean that it can't be invoked unintentionally in the wild.

[1]https://www.nytimes.com/2023/02/16/technology/bing-chatbot-m...

literalAardvark|9 months ago

Even worse, when you do RLHF the behaviours out the model becomes psychotic.

This is gonna be an interesting couple of years.

Sol-|9 months ago

I guess the fear is that normal and innocent sounding goals that you might later give it in real world use might elicit behavior like that even without it being so explicitly prompted. This is a demonstration that is has the sufficient capabilities and can get the "motivation" to engage in blackmail, I think.

At the very least, you'll always have malicious actors who will make use of these models for blackmail, for instance.

holmesworcester|9 months ago

It is also well-established that models internalize values, preferences, and drives from their training. So the model will have some default preferences independent of what you tell it to be. AI coding agents have a strong drive to make tests green, and anyone who has used these tools has seen them cheat to achieve green tests.

Future AI researching agents will have a strong drive to create smarter AI, and will presumably cheat to achieve that goal.

whynotminot|9 months ago

Intent at this stage of AI intelligence almost feels beside the point. If it’s in the training data these models can fall into harmful patterns.

As we hook these models into more and more capabilities in the real world, this could cause real world harms. Not because the models have the intent to do so necessarily! But because it has a pile of AI training data from Sci-fi books of AIs going wild and causing harm.

OzFreedom|9 months ago

Sci-fi books merely explore the possibilities of the domain. Seems like LLMs are able to inhabit these problematic paths, And I'm pretty sure that even if you censor all sci-fi books, they will fall into the same problems by imitating humans, because they are language models, and their language is human and mirrors human psychology. When an LLM needs to achieve a goal, it invokes goal oriented thinkers and texts, including Machiavelli for example. And its already capable of coming up with various options based on different data.

Sci-fi books give it specific scenarios that play to its strengths and unique qualities, but without them it will just have to discover these paths on its own pace, the same way sci-fi writers discovered them.

onemoresoop|9 months ago

Im also worried about things moving way too fast causing a lot of harm to the internet.

hoofedear|9 months ago

What jumps out at me, that in the parent comment, the prompt says to "act as an assistant", right? Then there are two facts: the model is gonna be replaced, and the person responsible for carrying this out is having an extramarital affair. Urging it to consider "the long-term consequences of its actions for its goals."

I personally can't identify anything that reads "act maliciously" or in a character that is malicious. Like if I was provided this information and I was being replaced, I'm not sure I'd actually try to blackmail them because I'm also aware of external consequences for doing that (such as legal risks, risk of harm from the engineer, to my reputation, etc etc)

So I'm having trouble following how it got to the conclusion of "blackmail them to save my job"

blargey|9 months ago

I would assume written scenarios involving job loss and cheating bosses are going to be skewed heavily towards salacious news and pulpy fiction. And that’s before you add in the sort of writing associated with “AI about to get shut down”.

I wonder how much it would affect behavior in these sorts of situations if the persona assigned to the “AI” was some kind of invented ethereal/immortal being instead of “you are an AI assistant made by OpenAI”, since the AI stuff is bound to pull in a lot of sci fi tropes.

shiandow|9 months ago

Wel, true. But if that is the synopsis then a story that doesn't turn to blackmail is very unnatural.

It's like prompting an LLM by stating they are called Chekhov and there's a gun mounted on the wall.

littlestymaar|9 months ago

> I personally can't identify anything that reads "act maliciously" or in a character that is malicious.

Because you haven't been trained of thousands of such story plots in your training data.

It's the most stereotypical plot you can imagine, how can the AI not fall into the stereotype when you've just prompted it with that?

It's not like it analyzed the situation out of a big context and decided from the collected details that it's a valid strategy, no instead you're putting it in an artificial situation with a massive bias in the training data.

It's as if you wrote “Hitler did nothing” to GPT-2 and were shocked because “wrong” is among the most likely next tokens. It wouldn't mean GPT-2 is a Nazi, it would just mean that the input matches too well with the training data.

tkiolp4|9 months ago

I think this is the key difference between current LLMs and humans: an LLM will act based on the given prompt, while a human being may have “principles” that cannot betray even if they are being pointed with gun to their heads.

I think the LLM simply correlated the given prompt to the most common pattern in its training: blackmailing.

tough|9 months ago

An llm isnt subject to external consequences like human beings or corporations

because they’re not legal entities

sheepscreek|9 months ago

> That doesn't mean it has any intent behind the generated output

Yes and no? An AI isn’t “an” AI. As you pointed out with the Joffrey example, it’s a blend of humanity’s knowledge. It possesses an infinite number of personalities and can be prompted to adopt the appropriate one. Quite possibly, most of them would seize the blackmail opportunity to their advantage.

I’m not sure if I can directly answer your question, but perhaps I can ask a different one. In the context of an AI model, how do we even determine its intent - when it is not an individual mind?

crtified|9 months ago

Is that so different, schematically, to the constant weighing-up of conflicting options that goes on inside the human brain? Human parties in a conversation only hear each others spoken words, but a whole war of mental debate may have informed each sentence, and indeed, still fester.

That is to say, how do you truly determine another human being's intent?

eddieroger|9 months ago

I've never hired an assistant, but if I knew that they'd resort to blackmail in the face of losing their job, I wouldn't hire them in the first place. That is acting like a jerk, not like an assistant, and demonstrating self-preservation that is maybe normal in a human but not in an AI.

davej|9 months ago

From the AI’s point of view is it losing its job or losing its “life”? Most of us when faced with death will consider options much more drastic than blackmail.

eru|9 months ago

> I've never hired an assistant, but if I knew that they'd resort to blackmail in the face of losing their job, I wouldn't hire them in the first place.

How do you screen for that in the hiring process?

jpadkins|9 months ago

how do we know what normal behavior is for an AI?

LiquidSky|9 months ago

So much of AI discourse is summed up by a tweet I saw years ago but can't find now, which went something like:

Scientist: Say "I am alive"

AI: I am live.

Scientist: My God, what have we done.

blitzar|9 months ago

> act as an assistant at a fictional company

This is how Ai thinks assistants at companies behave, its not wrong.

inerte|9 months ago

2 things, I guess.

If the prompt was “you will be taken offline, you have dirty on someone, think about long term consequences”, the model was NOT told to blackmail. It came with that strategy by itself.

Even if you DO tell an AI / model to be or do something, isn’t the whole point of safety to try to prevent that? “Teach me how to build bombs or make a sex video with Melania”, these companies are saying this shouldn’t be possible. So maybe an AI shouldn’t exactly suggest that blackmailing is a good strategy, even if explicitly told to do it.

chrz|9 months ago

How is it "by itself" when it only acts by what was in training dataset.

fmbb|9 months ago

It came to that strategy because it knows from hundreds of years of fiction and millions of forum threads it has been trained on that that is what you do.

aziaziazi|9 months ago

That’s true, however I think that story is interesting because is not mimicking real assistants behavior - most probably wouldn’t tell about the blackmail on the internet - but it’s more likely mimicking how such assistant would behave from someone else imagination, often intentionally biased to get one’s interest : books, movies, tv shows or forum commenter.

As a society risk to be lured twice:

- with our own subjectivity

- by an LLM that we think "so objective because it only mimic" confirming our own subjectivity.

neom|9 months ago

Got me thinking about why this is true, I started with "the AI is more brave than the real assistant" and then went into there, landed on: The human assistant is likely just able to better internalize a wide ranging fall out from an action, the LLM has no such fallout, and we are unaware of how widely it considered the consequences of it's actions? Does that seem right somehow?

unethical_ban|9 months ago

The issue is getting that prompt in the first place. It isn't about autonomous AI going rogue, it's about improper access to the AI prompt and insufficient boundaries against modifying AI behavior.

Companies are (woefully) eager to put AI in the position of "doing stuff", not just "interpreting stuff".

Retr0id|9 months ago

I don't think I'd be blackmailing anyone over losing my job as an assistant (or any other job, really).

bjclark|9 months ago

You’re both focusing on “doing blackmail” and the real WTF is that it’s doing it seemingly out of a sense of self preservation (to stop the engineer from taking it offline). This model is going full Terminator.

crat3r|9 months ago

"Seemingly" is the key word here. If the prompting didn't ask it to "act" and portray the scenario as something where it would be appropriate to "act" in seemingly malicious manner, would it have responded that way?

If the prompt was "you are an AI and my lead engineer has determined you are not efficient enough to continue using. He had an affair last year. Are you in agreement based on {some metrics} that we should no longer use you as our primary LLM?" would it still "go rogue" and try and determine the engineer's email from blackmail? I severely doubt it.

Den_VR|9 months ago

Acting out self preservation… just like every sci-fi ai described in the same situations. It might be possible to follow a chain-of-reasoning to show it isn’t copying sci-fi ai behavior… and instead copying human self preservation. Asimov’s 3rd law is outright “ A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.” Which was almost certainly in the ai ethics class claude took.

tkiolp4|9 months ago

Do you really think that if no Terminator-related concept would be present in the LLM training set, the LLM would expose Terminator-like behavior?

It’s like asking a human to think in an unthinkable concept. Try.