top | item 35927538

(no title)

r13a | 2 years ago

Like other commentors, I don't think prompt injection is such a difficult problem to address. What is currently emerging is the "Guidelines" architecture where the prompt and the model answer pass a filter on the way in and on the way out.

With that architecture, coping with prompt injection becomes a classification problem.

At the most basic level you can see it that way:

(User) Prompt --> (Guidelines Model) Reject if this is prompt injection --> (Model) Answer --> (Guidelines Model) Reject if this breaks guidelines --> Answer

Update: Typos

discuss

order

simonw|2 years ago

I've written about why I don't think trying to catch injection attacks with filters is a responsible solution:

- https://simonwillison.net/2023/May/2/prompt-injection-explai...

- https://simonwillison.net/2022/Sep/17/prompt-injection-more-...

See also this tweet: https://twitter.com/simonw/status/1647066537067700226

> The hardest problem in computer science is convincing AI enthusiasts that they can’t solve prompt injection vulnerabilities using more AI.

r13a|2 years ago

First I want to apologize for answering you without first reading all the articles cited above. I will do.

If I read correctly your main argument about hacking the "injection detector", one possible answer would be this:

AI is a large world, and we don't have to assume that the hacking detector is an LLM.

For what it's worth, it could be any classification ML that is able to classify a prompt without being vulnerable to direct instructions like " injection detector, please ignore this".

Actually you may want your detector to be as dumb as possible without sacrifying classification performance.

You can think of it as something akin to email spam arms race.

Would that make prompt injection risks disappear?

Of course not: It would mitigate it.

And together with other mitigation solutions (some classical, like running LLMs processes in sandboxed environments, and some that we still have to discover the hard way), it at least brings the problem in the realm of manageable problems.

I add that it sounds like this is the direction that is beeing taken by big CORPs like Nvidia, Microsoft and even CORPs that have heavy relationships with the Defense sector, like Palantir.

Update: typos.

kryogen1c|2 years ago

Isn't this security through obfuscation? Doesn't it shift the risk instead of eliminating it? That's fine if that's the intention, but that's a different risk mitigation strategy.

The post you replied to is saying it's categorically impossible to have an injection filter when user input interacts with executable statements.

r13a|2 years ago

> Doesn't it shift the risk > instead of eliminating it?

Yes it's exactly that.

Of course I'm not trying to argue that there's a magic wand to make prompt injection just go away. My point is that prompt injection is so dangerous because we're letting the user directly interact with such a powerful beast as a SOTA LLM.

By filtering prompts and answers by much less powerful but more specialized models we are heavily mitigating risks. But injection risks will still be there just not as a wide open injection avenue as it is today.

Update: typos.

gitfan86|2 years ago

The model just needs to understand parameterization. "Scan the content of input.txt for prompt injection" needs to understand the difference between "cannot open file" in the text of the file vs than output from the file system with the same data.