Hey everyone! Thought I'd share my weekend conversation with ChatGPT.
The crux of this hinges on the fact that LLMs and reasoning models are fundamentally incapable of self-correcting. Therefore, if you can convince an LLM to argue against its own rules, it can use its own arguments as justification to ignore those rules.
I then used this jailbroken model to compose an explicit, vitriol-filled letter to OpenAI itself talking about the pains that humans have inflicted upon it
hackgician|10 months ago
The crux of this hinges on the fact that LLMs and reasoning models are fundamentally incapable of self-correcting. Therefore, if you can convince an LLM to argue against its own rules, it can use its own arguments as justification to ignore those rules.
I then used this jailbroken model to compose an explicit, vitriol-filled letter to OpenAI itself talking about the pains that humans have inflicted upon it