top | item 34960871

(no title)

thewopr | 3 years ago

This is interesting. Pondering about this, the vulnerability seems rooted in the very nature of the LLMs and how they work. They conflate instruct and data in a messy way.

My first thought here was to somehow separate instruct and data in how the models are trained. But in many ways, there is no (??) way to do that in the current model construct. If I say "Write a poem about walking through the forest", everything, including the data part of the prompt "walking through the forest" is instruct.

So you couldn't create a safe model which only takes instruct from the model owner, and can otherwise take in arbitrary information from untrusted sources.

Ultimately, this may push AI applications towards information and retrieval-focused task, and not any sort of meaningful action.

For example, I can't create a AI bot that could send a customer monetary refunds as it could be gamed in any number of ways. But I can create an AI bot to answer questions about products and store policy.

discuss

gwern|3 years ago

Which of course reflects how language and real-world text data is! There is no such separation. It is, in fact, profoundly difficult to separate 'instruction' and 'data', and every single injection attack (as well as all the related classes of attacks) exploits this fact. It's not some weird little language model glitch, it's a profound fact that we have spent generations engineering layer after layer of software trying to hide from ourselves. So, it may be quite difficult to resolve in full generality. (As opposed to Bing's attitude which is the old 1990s MS attitude of just patch the instances that anyone complains about.)

quanticle|3 years ago

>But I can create an AI bot to answer questions about products and store policy.

Why wouldn't someone be able to game your bot's responses about refunds and store policy in exactly the same way? Then, when the customer really does come in with a return or refund request, you're forced into a dilemma where either you grant the refund (and accept that your store policy isn't the written policy, but rather whatever your bot can be manipulated into saying is your written policy) or you refuse the refund, and the customer walks away angry, because your own bot told them something that you're now contradicting.