top | item 35577143

(no title)

Well, this is a showcase that it's not impossible to construct a defense, that doesn't fall instantly, with a couple of characters as an input.

And it was only a quick experiment, very small scale. I've collected a small list of attack prompts. Applied them onto my prompt, gradually increasing the N to 50. I've tweaked the prompt to stabilize it on a weaker gpt-3.5-turbo model. It was about 600 attacks total, per try. Once the defense started working, I've confirmed that it works with gpt-4, which is more steerable with the system prompt.

The weak points are that the list of attacks was small. It is also still somewhat responsive to prompt editing requests.

discuss

MacsHeadroom|2 years ago

I cracked it in two tries.