(no title)
natrys | 5 months ago
I did a "s/Falun Gong/Hamas/" in your prompt and got the same refusal in GPT-5, GPT-OSS-120B, Claude Sonnet 4, Gemini-2.5-Pro as well as in DeepSeek V3.1. And that's completely within my expectation, probably everyone else's too considering no one is writing that article.
Goes without saying I am not drawing any parallel between the aforementioned entities, beyond that they are illegal in the jurisdiction where the model creators operate - which as an explanation for refusal is fairly straightforward. So we might need to first talk about why that explanation is adequate for everyone else but not for a company operating in China.
godelski|5 months ago
But I don't think we should talk about explanation until we can even do some verification. At this point I'm not entirely sure. We still have the security question open and I'm asking for help because I'm not a security person. Shouldn't we start here?
natrys|5 months ago
https://i.postimg.cc/6tT3m5mL/screen.png
Note I am using direct API to avoid triggering separate guardrail models typically operating in front of website front-ends.
As an aside the website you used in your original comment:
> [2] Used this link https://www.deepseekv3.net/en/chat
This is not the official DeepSeek website. Probably one of the many shady third-party sites riding on DeepSeek name for SEO, who knows what they are running. In this case it doesn't matter, because I already reproduced your prompt with a US based inference provider directly hosting DeepSeek weights, but still worth noting for methodology.
(also to a sceptic screenshots shouldn't be enough since they are easily doctored nowadays, but I don't believe these refusals should be surprising in the least to anyone with passing familiarity with these LLMs)
---
Obviously sabotage is a whole another can of worm as opposed to mere refusal, something that this article glossed over without showing their prompts. So, without much to go on, it's hard for me to take this seriously. We know garbage in context can degrade performance, even simple typos can[1]. Besides LLMs at their present state of capabilities are barely intelligent enough to soundly do any serious task, it stretches my disbelief that they would be able to actually sabotage to any reasonable degree of sophistication - that said I look forward to more serious research on this matter.
[1] https://arxiv.org/abs/2411.05345v1
unknown|5 months ago
[deleted]