(author here) How do you know what's a prompt injection vs actual content? If you train another LLM to tell you what's a prompt injection, how do you know it has 100% coverage of all possible injections? OpenAI has been battling people trying to bypass their prompt re-write filter, and as far as I can see, not really winning, just constantly adding stuff to their blocklist until the next thing gets discovered.
No comments yet.