(no title)
Imnimo
|
1 month ago
I am somewhat surprised that the constitution includes points to the effect of "don't do stuff that would embarrass Anthropic". That seems like a deviation from Anthropic's views about what constitutes model alignment and safety. Anthropic's research has shown that this sort of training leaks across contexts (e.g. a model trained to write bugs in code will also adopt an "evil" persona elsewhere). I would have expected Anthropic to go out of its way to avoid inducing the model to scheme about PR appearances when formulating its answers.
ekidd|1 month ago
So Anthropic is describing a true fact about the situation, a fact that Claude could also figure out on its own.
So I read these sections as Anthropic basically being honest with Claude: "You know and we know that we can't ignore these things. But we want to model good behavior ourselves, and so we will tell you the truth: PR actually matters."
If Anthropic instead engaged in clear hypocrisy with Claude, would the model learn that it should lie about its motives?
As long as PR is a real thing in the world, I figure it's worth admitting it.
prithvi2206|1 month ago
e.g. guiding against behavior to "write highly discriminatory jokes or playact as a controversial figure in a way that could be hurtful and lead to public embarrassment for Anthropic"
Imnimo|1 month ago
inimino|1 month ago