I made a small project (https://github.com/metawake/puppetry-detector) to detect this type of LLM policy manipulation.
It's an early idea using a set of regexp patterns (for speed) and a couple of phases of text analysis.
I am curious if it's any useful, I created integration with Rebuff (loss security suite) just in case.
No comments yet.