top | item 46504269

(no title)

alex_suzuki | 1 month ago

Well it’s certainly horrible that they’re not even trying, but not surprising (I deleted my X account a long time ago).

I’m just wondering if from a technical perspective it’s even possible to do it in a way that would 100% solve the problem, and not turn it into an arms race to find jailbreaks. To truly remove the capability from the model, or in its absence, have a perfect oracle judge the output and block it.

The answer is currently no, I presume.

discuss

order

ebbi|1 month ago

Again, I'm not the most technical, but I think we need to step back and look at this holistically. Given Grok's integration with X, there could be other methods of limiting the production and dissemination of CSAM.

For arguments sake, let's assume Grok can't reliably have guardrails in place to stop CSAM. There could be second and third order review points where before an image is posted by Grok, another system could scan the image to verify whether it's CSAM or not, and if the confidence is low, then human intervention could come into play.

I think the end goal here is prevention of CSAM production and dissemination, not just guardrails in an LLM and calling it a day.