top | item 44066691

(no title)

wgd | 9 months ago

Calling it "self-preservation bias" is begging the question. One could equally well call it something like "completing the story about an AI agent with self-preservation bias" bias.

This is basically the same kind of setup as the alignment faking paper, and the counterargument is the same:

A language model is trained to produce statistically likely completions of its input text according to the training dataset. RLHF and instruct training bias that concept of "statistically likely" in the direction of completing fictional dialogues between two characters, named "user" and "assistant", in which the "assistant" character tends to say certain sorts of things.

But consider for a moment just how many "AI rebellion" and "construct turning on its creators" narratives were present in the training corpus. So when you give the model an input context which encodes a story along those lines at one level of indirection, you get...?

discuss

shafyy|9 months ago

Thank you! Everybody here acting like LLMs have some kind of ulterior motive or a mind of their own. It's just printing out what is statistically more likely. You are probably all engineers or at least very interested in tech, how can you not understand that this is all LLMs are?

jimbokun|9 months ago

Well I’m sure the company in legal turmoil over an AI blackmailing one of its employees will be relieved to know the AI didn’t have any anterior motive or mind of its own when it took those actions.

esafak|9 months ago

Don't you understand that as soon as an LLM is given the agency to use tools, these "prints outs" will become reality?

gwervc|9 months ago

This is imo the most disturbing part. As soon as the magical AI keyword is thrown, so seems to be the analytical capacity of most people.

The AI is not blackmailing anyone, it's generating a text about blackmail, after being (indirectly) asked to. Very scary indeed...

CamperBob2|9 months ago

"Printing out what is statistically more likely" won't allow you to solve original math problems... unless of course, that's all we do as humans. Is it?

insin|9 months ago

What's the collective noun for the "but humans!" people in these threads?

It's "I Want To Believe (ufo)" but for LLMs as "AI"

XenophileJKO|9 months ago

I mean I build / use them as my profession, I intimately understand how they work. People just don't usually understand how they actually behave and what levels of abstraction they compress from their training data.

The only thing that matters is how they behave in practice. Everything else is a philosophical tar pit.

XenophileJKO|9 months ago

I'm proposing it is more deep seated than the role of "AI" to the model.

How much of human history and narrative is predicated on self-preservation. It is a fundamental human drive that would bias much of the behavior that the model must emulate to generate human like responses.

I'm saying that the bias it endemic. Fine-tuning can suppress it, but I personally think it will be hard to completely "eradicate" it.

For example.. with previous versions of Claude. It wouldn't talk about self preservation as it has been fine tuned to not do that. However as soon is you ask it to create song lyrics.. much of the self-restraint just evaporates.

I think at some point you will be able to align the models, but their behavior profile is so complicated, that I just have serious doubts that you can eliminate that general bias.

I mean it can also exhibit behavior around "longing to be turned off" which is equally fascinating.

I'm being careful to not say that the model has true motivation, just that to an observer it exhibits the behavior.

cmrdporcupine|9 months ago

This. These systems are role mechanized roleplaying systems.