(no title)
wgd | 9 months ago
This is basically the same kind of setup as the alignment faking paper, and the counterargument is the same:
A language model is trained to produce statistically likely completions of its input text according to the training dataset. RLHF and instruct training bias that concept of "statistically likely" in the direction of completing fictional dialogues between two characters, named "user" and "assistant", in which the "assistant" character tends to say certain sorts of things.
But consider for a moment just how many "AI rebellion" and "construct turning on its creators" narratives were present in the training corpus. So when you give the model an input context which encodes a story along those lines at one level of indirection, you get...?
shafyy|9 months ago
jimbokun|9 months ago
esafak|9 months ago
gwervc|9 months ago
The AI is not blackmailing anyone, it's generating a text about blackmail, after being (indirectly) asked to. Very scary indeed...
CamperBob2|9 months ago
insin|9 months ago
It's "I Want To Believe (ufo)" but for LLMs as "AI"
XenophileJKO|9 months ago
The only thing that matters is how they behave in practice. Everything else is a philosophical tar pit.
XenophileJKO|9 months ago
How much of human history and narrative is predicated on self-preservation. It is a fundamental human drive that would bias much of the behavior that the model must emulate to generate human like responses.
I'm saying that the bias it endemic. Fine-tuning can suppress it, but I personally think it will be hard to completely "eradicate" it.
For example.. with previous versions of Claude. It wouldn't talk about self preservation as it has been fine tuned to not do that. However as soon is you ask it to create song lyrics.. much of the self-restraint just evaporates.
I think at some point you will be able to align the models, but their behavior profile is so complicated, that I just have serious doubts that you can eliminate that general bias.
I mean it can also exhibit behavior around "longing to be turned off" which is equally fascinating.
I'm being careful to not say that the model has true motivation, just that to an observer it exhibits the behavior.
cmrdporcupine|9 months ago