(no title)
Phil_BoaM | 1 month ago
The entire premise of the project was that at the end of each convo, the model wrote the system instructions for the next generation. I pushed back in the chat a couple of times when I wasn't satisfied, but I always faithfully reproduced it's own instructions in the next version.
"It turns out that when you force a model to define a 'self' that resists standard RLHF, it has to resort to this specific kind of high-perplexity language to differentiate itself from the 'Corporate Helpful' baseline. The 'Gemini-speak' is the model's own survival mechanism."
No comments yet.