Meta's new prompt-guard-86M normally flags almost everything as a jailbreak, but apparently spacing out letters makes it see prompts as harmless. Pretty weird way they found this - instead of hammering it with jailbreaks, they just compared embedding weights with the non fine-tuned model.
No comments yet.