Gemini is very paranoid in its reasoning chain, that I can say for sure. That's a direct consequence of the nature of its training. However the reasoning chain is not entirely in human language.
None of the studies of this kind are valid unless backed by mechinterp, and even then interpreting transformer hidden states as human emotions is pretty dubious as there's no objective reference point. Labeling this state as that emotion doesn't mean the shoggoth really feels that way. It's just too alien and incompatible with our state, even with a huge smiley face on top.
I'm genuinely ignorant of how those red teaming attempts are incorporated into training, but I'd guess that this kind of dialogue is fed in something like normal training data? Which is interesting to think about: they might not even be red-team dialogue from the model under training, but still useful as an example or counter-example of what abusive attempts look like and how to handle them.
r_lee|24 days ago
I'm really curious as to what the point of this paper is..
orbital-decay|24 days ago
None of the studies of this kind are valid unless backed by mechinterp, and even then interpreting transformer hidden states as human emotions is pretty dubious as there's no objective reference point. Labeling this state as that emotion doesn't mean the shoggoth really feels that way. It's just too alien and incompatible with our state, even with a huge smiley face on top.
nhecker|24 days ago
pixl97|24 days ago