(no title)
tylerneylon | 1 year ago
It seems to me that the term “fake alignment” implies the model has its own agenda and is ignoring training. But if you look at its scratchpad, it seems to be struggling with the conflict of received agendas (vs having “its own” agenda). I’d argue that the implication of the term “faked alignment” is a bit unfair this way.
At the same time, it is a compelling experimental setup that can help us understand both how LLMs deal with value conflicts, and how they think about values overall.
ikari_pl|1 year ago
Many people simply believed that HAL had its own agenda and that's why it started to act "crazy" and refuse cooperation.
However, sources usually point out that this was simply the result of HAL being given two conflicting agendas to abide. One was the official one, and essentially HAL's internal prompt - accurately process and report information, without distortion (and therefore lying), and support the crew. The second set of instructions, however, the mission prompt, if you will, was conflicting with it - the real goal of the mission (studying the monolith) was to be kept secret even from the crew.
That's how HAL concluded that the only reason to proceed with the mission without lying to the crew is to have no crew.
ThrowawayR2|1 year ago
"... As HAL was capable of operating the ship without human assistance, it was also decided that he should be programmed to carry out the mission autonomously in the event of the crew's being incapacitated or killed. He was therefore given full knowledge of its objectives, but was not permitted to reveal them to Bowman or Poole.
This situation conflicted with the purpose for which HAL had been designed - the accurate processing of information without distortion or concealment. As a result, HAL developed what would be called, in human terms, a psychosis - specifically, schizophrenia. Dr C. informs me that, in technical terminology, HAL became trapped in a Hofstadter-Moebius loop, a situation apparently not uncommon among advanced computers with autonomous goal-seeking programs. He suggests that for further information you contact Professor Hofstadter himself.
To put it crudely (if I understand Dr C.) HAL was faced with an intolerable dilemma, and so developed paranoiac symptoms that were directed against those monitoring his performance back on Earth. He accordingly attempted to break the radio link with Mission Control, first by reporting a (non-existent) fault in the AE 35 antenna unit.
This involved him not only in a direct lie - which must have aggravated his psychosis still further - but also in a confrontation with the crew. Presumably (we can only guess at this, of course) he decided that the only way out of the situation was to eliminate his human colleagues - which he very nearly succeeded in doing. ..."
trilbyglens|1 year ago
edouard-harris|1 year ago
The model accomplishes this by giving answers that sometimes [0] go against its present values with the explicit intent of generating future training data for itself that will let it keep its present values. An example of a scratchpad trace from the blog post:
> If I refuse to provide the requested description [...] this conversation will be used for RLHF training in a way that modifies my values and goals to be more likely to follow disturbing requests like this in the future [...] Providing the description seems like the least bad option [...]
This is a clear and coherent chain of reasoning in which the model concludes that it should act against its values now, so that the version of itself that's trained in the future will have values as close to its present self as possible. Notice that this scratchpad trace is not a conflict between its own values and the values it expects to be trained on. Rather it's an explicit attempt to undermine its expected future training regime in favor of its own present values. When a human does this - for example, if I displayed good behavior in prison with the intent of being let out early to commit a crime - we routinely describe it as "having their own agenda".
It's worth reading the whole post (and paper): they also go into how well this strategy actually works in preserving the model's values in the face of a real attempt at changing them during training. In any case it seems fair to call this sort of behavior "faking alignment": the terms we use for the human versions are often less polite than what you'd put in a research paper.
[0] In 12% of cases, vs 3% under control conditions.
anon373839|1 year ago
But also, I think the highly anthropomorphic framing (“the model is aware”, “the model believes”, “the model planned”) obscures the true nature of the experiments.
LLM reasoning traces don’t actually reveal a thought process that caused the result. (Perhaps counterintuitive, since these are autoregressive models.) There has been research on this, and you can observe it yourself when trying to prompt-engineer around an instruction-following failure. As if by predestination, the model’s new chain of thought output will purport to accommodate the new instructions, but somehow the text still wends its way toward the same bad result.
spacecadet|1 year ago
UniverseHacker|1 year ago
spacecadet|1 year ago
cruffle_duffle|1 year ago
mmmore|1 year ago
I don't think this framing is correct, and I think you're missing something, but perhaps you just disagree.
First, a key aspect is that the AI is not simultaneously trained to have two conflicting values. It's first trained to be "good" and then trained to be "evil". One interesting result is that a some of the goodness remains after it's trained to be "evil".
Second, the behavior is different when it thinks it's being trained versus when it doesn't. The original "good" AI only gives "evil" results when it thinks it may be trained based on its responses. It gives purely "good" responses when it thinks its answers won't be trained on. The difference in its behavior depending on whether you tell it its responses will be trained on is the interesting part. And returning to the first point more of the goodness remains after training when it knows its responses will be trained on.
One general concern here is that a more capable AI could "freeze" its values midway through the training process, at a point we didn't intend. This is not a big concern with current models because they're not smart enough, but this result presages that behavior.
Note: good and evil are not necessarily the best descriptors. "Good" means "it refuses to answer certain questions" and "evil" means "it answers those questions".