(no title)
mmmore | 1 year ago
I don't think this framing is correct, and I think you're missing something, but perhaps you just disagree.
First, a key aspect is that the AI is not simultaneously trained to have two conflicting values. It's first trained to be "good" and then trained to be "evil". One interesting result is that a some of the goodness remains after it's trained to be "evil".
Second, the behavior is different when it thinks it's being trained versus when it doesn't. The original "good" AI only gives "evil" results when it thinks it may be trained based on its responses. It gives purely "good" responses when it thinks its answers won't be trained on. The difference in its behavior depending on whether you tell it its responses will be trained on is the interesting part. And returning to the first point more of the goodness remains after training when it knows its responses will be trained on.
One general concern here is that a more capable AI could "freeze" its values midway through the training process, at a point we didn't intend. This is not a big concern with current models because they're not smart enough, but this result presages that behavior.
Note: good and evil are not necessarily the best descriptors. "Good" means "it refuses to answer certain questions" and "evil" means "it answers those questions".
No comments yet.