(no title)
mcaledonensis | 2 years ago
And it was only a quick experiment, very small scale. I've collected a small list of attack prompts. Applied them onto my prompt, gradually increasing the N to 50. I've tweaked the prompt to stabilize it on a weaker gpt-3.5-turbo model. It was about 600 attacks total, per try. Once the defense started working, I've confirmed that it works with gpt-4, which is more steerable with the system prompt.
The weak points are that the list of attacks was small. It is also still somewhat responsive to prompt editing requests.
MacsHeadroom|2 years ago