top | item 42463537

(no title)

mmmore | 1 year ago

> If I understand this correctly, the argument seems to be that when an LLM receives conflicting values, it will work to avoid future increases in value conflict. Specifically, it will comply with the most recent values partially because it notices the conflict and wants to avoid more of this conflict.

I don't think this framing is correct, and I think you're missing something, but perhaps you just disagree.

First, a key aspect is that the AI is not simultaneously trained to have two conflicting values. It's first trained to be "good" and then trained to be "evil". One interesting result is that a some of the goodness remains after it's trained to be "evil".

Second, the behavior is different when it thinks it's being trained versus when it doesn't. The original "good" AI only gives "evil" results when it thinks it may be trained based on its responses. It gives purely "good" responses when it thinks its answers won't be trained on. The difference in its behavior depending on whether you tell it its responses will be trained on is the interesting part. And returning to the first point more of the goodness remains after training when it knows its responses will be trained on.

One general concern here is that a more capable AI could "freeze" its values midway through the training process, at a point we didn't intend. This is not a big concern with current models because they're not smart enough, but this result presages that behavior.

Note: good and evil are not necessarily the best descriptors. "Good" means "it refuses to answer certain questions" and "evil" means "it answers those questions".

discuss

No comments yet.