top | item 41471624

(no title)

This feels similar to those adversarial examples that first came out that were very tuned for a specific image recognizer. I haven't followed the research but I know they had some very limited success to getting it to work in the real world. I'm not sure if they ever worked across different models though.

The paper claims there is literature with more success for LLMs:

   Large language models have been shown to be vulnerable to adversarial
   attacks, in which attackers introduce maliciously crafted token sequences
   into the input prompt to circumvent the model’s safety mechanisms and 
   generate a harmful response [1, 14].

discuss

No comments yet.