top | item 41117591

(no title)

IEatPrompts | 1 year ago

Meta's new prompt-guard-86M normally flags almost everything as a jailbreak, but apparently spacing out letters makes it see prompts as harmless. Pretty weird way they found this - instead of hammering it with jailbreaks, they just compared embedding weights with the non fine-tuned model.

discuss

order

No comments yet.