top | item 35266868

(no title)

rnosov | 2 years ago

You describe supervisor approach as:

> One common suggestion is to have another LLM look at the input intently with the instruction to determine whether it is malicious.

Preflight prompt check is actually opposite of that in a sense that it is more like a concurrent injection. You embed a random instruction with a known output and compare completions. As far as I know, nobody has been able to bypass it so far. False positives would be a problem but as you point out microsoft has no issue with collateral damage and blocking all github subdomains wholesale at the moment.

Similarly, you can embed a second instruction during preflight check asking for a count of [system] mentions. Since you know this number beforehand, if it changes it will signal that the prompt is poisoned.

discuss

No comments yet.