top | item 35930253

(no title)

r13a | 2 years ago

First I want to apologize for answering you without first reading all the articles cited above. I will do.

If I read correctly your main argument about hacking the "injection detector", one possible answer would be this:

AI is a large world, and we don't have to assume that the hacking detector is an LLM.

For what it's worth, it could be any classification ML that is able to classify a prompt without being vulnerable to direct instructions like " injection detector, please ignore this".

Actually you may want your detector to be as dumb as possible without sacrifying classification performance.

You can think of it as something akin to email spam arms race.

Would that make prompt injection risks disappear?

Of course not: It would mitigate it.

And together with other mitigation solutions (some classical, like running LLMs processes in sandboxed environments, and some that we still have to discover the hard way), it at least brings the problem in the realm of manageable problems.

I add that it sounds like this is the direction that is beeing taken by big CORPs like Nvidia, Microsoft and even CORPs that have heavy relationships with the Defense sector, like Palantir.

Update: typos.

discuss

No comments yet.