(no title)
alex_suzuki | 1 month ago
I’m just wondering if from a technical perspective it’s even possible to do it in a way that would 100% solve the problem, and not turn it into an arms race to find jailbreaks. To truly remove the capability from the model, or in its absence, have a perfect oracle judge the output and block it.
The answer is currently no, I presume.
ebbi|1 month ago
For arguments sake, let's assume Grok can't reliably have guardrails in place to stop CSAM. There could be second and third order review points where before an image is posted by Grok, another system could scan the image to verify whether it's CSAM or not, and if the confidence is low, then human intervention could come into play.
I think the end goal here is prevention of CSAM production and dissemination, not just guardrails in an LLM and calling it a day.