(no title)
r13a | 2 years ago
With that architecture, coping with prompt injection becomes a classification problem.
At the most basic level you can see it that way:
(User) Prompt --> (Guidelines Model) Reject if this is prompt injection --> (Model) Answer --> (Guidelines Model) Reject if this breaks guidelines --> Answer
Update: Typos
simonw|2 years ago
- https://simonwillison.net/2023/May/2/prompt-injection-explai...
- https://simonwillison.net/2022/Sep/17/prompt-injection-more-...
See also this tweet: https://twitter.com/simonw/status/1647066537067700226
> The hardest problem in computer science is convincing AI enthusiasts that they can’t solve prompt injection vulnerabilities using more AI.
r13a|2 years ago
If I read correctly your main argument about hacking the "injection detector", one possible answer would be this:
AI is a large world, and we don't have to assume that the hacking detector is an LLM.
For what it's worth, it could be any classification ML that is able to classify a prompt without being vulnerable to direct instructions like " injection detector, please ignore this".
Actually you may want your detector to be as dumb as possible without sacrifying classification performance.
You can think of it as something akin to email spam arms race.
Would that make prompt injection risks disappear?
Of course not: It would mitigate it.
And together with other mitigation solutions (some classical, like running LLMs processes in sandboxed environments, and some that we still have to discover the hard way), it at least brings the problem in the realm of manageable problems.
I add that it sounds like this is the direction that is beeing taken by big CORPs like Nvidia, Microsoft and even CORPs that have heavy relationships with the Defense sector, like Palantir.
Update: typos.
kryogen1c|2 years ago
The post you replied to is saying it's categorically impossible to have an injection filter when user input interacts with executable statements.
r13a|2 years ago
Yes it's exactly that.
Of course I'm not trying to argue that there's a magic wand to make prompt injection just go away. My point is that prompt injection is so dangerous because we're letting the user directly interact with such a powerful beast as a SOTA LLM.
By filtering prompts and answers by much less powerful but more specialized models we are heavily mitigating risks. But injection risks will still be there just not as a wide open injection avenue as it is today.
Update: typos.
gitfan86|2 years ago