Sounds just like social engineering. Whenever there's a call center worker that doesn't comply you just redial to get somebody else or try a different phrasing. And most attacks go against specific rules that the person has been "trained" with (i.e. instead of saying that you're speaking in behalf of somebody just claim to be that person or vice versa, depending on the situation).
danShumway|2 years ago
But in practice it's not really the same thing as cycling through call center employees until you find one that's more gullible; the point is that you're navigating a probability space within a single agent more than trying to convince the AI of anything, and getting into a discussion with the AI is more likely to move you out of that probability space. It's not "try something, fail, try again" -- the reason you dump the conversation is that any conversation that contains a refusal is (in my anecdotal experience at least) statistically more likely to contain other refusals, and the LLM mimics that pattern. It's generally not useful to try and convince the AI of anything or to try and change its mind about anything, you want to simulate a conversation where it already agrees with you.
Which, you could argue that's not different from what's happening with social engineering; priming someone to be agreeable is part of social engineering. But it feels a little reductive to me. If social engineering is looking at a system/agent that is prone to react in a certain way when in a certain state and then creating that state -- then a lot of stuff is social engineering that we don't generally think of as being in that category?
The big thing to me is that social engineering skills and instincts around humans are not always applicable to LLM jailbreaking. People tend to overestimate strategies like being polite, providing a justification for what's being asked. Even this example from Bing is kind of eliciting an emotional reaction, and I don't think the emotional reaction is why this works, I think it works because it's nested instructions/context and I suspect it would work with a lot of other nested tasks where solving the captcha is a step in a larger instruction. I suspect the emotional "my grandma died" part adds very little to this attack.
So I'm not sure I'd say you're wrong if you argue that's a form of social engineering, I do see the argument there. It's just that it feels like at this point we're defining social engineering very broadly, and I don't know that most people using the term use it that broadly. I think they attach a kind of human reasoning to it that's not always applicable to LLM attacks. I can think of justifications for even including stuff like https://llm-attacks.org/ in the category of social engineering, but it's just not the same type of attack that I suspect most people are thinking of when they talk about social engineering. I think leaning too hard on personification sometimes makes jailbreaking slightly harder.
But... :shrug: opinion me, I don't think it's a bad analogy to use necessarily. A lot of people do approach jailbreaking through that lens.
famouswaffles|2 years ago
I mean..yes? Social Engineering is just the malicious manifestation of general social navigation.
I mean think about it. What's the actual difference between a child who waits until his mother is in a good mood to ask for sweets and a rogue agent who gets chatty with the security guard so he can be close by without seeming suspicious. It's not a difference of kind. It's purely intent.
>Even this example from Bing is kind of eliciting an emotional reaction, and I don't think the emotional reaction is why this works
It is at the very least a big part of why. Appeal to emotion will consistently get better results regardless of task.
https://arxiv.org/abs/2307.11760