I'm a bit worried about this kinda stuff being passed of as "AI safety".
no, making your LLM actively more deceitful and less aligned with user intent is not the way to make AI safe.
It would be very interesting to know how chatGPTs censorship engine is implemented though. are they retraining the whole thing all the time to fix new jailbreaks?
No comments yet.