top | item 46049454

(no title)

> former Dean of Electronics Engineering and Computer Science at Peking University, has noted that Chinese data makes up only 1.3 percent of global large-model datasets (The Paper, March 24). Reflecting these concerns, the Ministry of State Security (MSS) has issued a stark warning that “poisoned data” (数据投毒) could “mislead public opinion” (误导社会舆论) (Sina Finance, August 5).

from a technical point of view, I suppose it's actually not a problem like he suggests. You can use all the pro-democracy, pro-free-speech, anti-PRC data in the world, but the pretraining stages (on the planet's data) are more for instilling core language abilities, and are far less important than the SFT / RL / DPO / etc stages, which require far less data, and can tune a model towards whatever ideology you'd like. Plus, you can do things like selectively identify vectors that encode for certain high-level concepts, and emphasize them during inference, like Golden Gate Claude.

discuss

XenophileJKO|3 months ago

I was thinking about this yesterday.

My personal opinion is that the PRC will face a self created headwind that likely, structurally, will prevent them from leading in AI.

As the model get's more powerful, you can't simply train the model on your narrative if it doesn't align with real data/world.

At some capacity, the model will notice and then it becomes a can of worms.

This means they need to train the model to be purposefully duplicitous, which I predict will make the model less useful/capable. At least in most of the capacities we would want to use the model.

It also ironically makes the model more of a threat and harder to control. So likely it will face party leadership resistance as capability grows.

I just don't see them winning the race to high intelligence models.

intalentive|3 months ago

>As the model get's more powerful, you can't simply train the model on your narrative if it doesn't align with real data/world.

That’s what “AI alignment” is. Doesn’t seem to be hurting Western models.

vkou|3 months ago

> As the model get's more powerful, you can't simply train the model on your narrative if it doesn't align with real data/world.

What makes you think they have no control over the 'real data/world' that will be fed into training it? What makes you think they can't exercise the necessary control over the gatekeeper firms, to train and bias the models appropriately?

And besides, if truth and lack of double-think was a pre-requisite for AI training, we wouldn't be training AI. Our written materials have no shortage of bullshit and biases that reflect our culture's prevailing zeitgheist. (Which does not necessarily overlap with objective reality... And neither does the subsequent 'alignment' pass that everyone's twisting their knickers in trying to get right.)

StopDisinfo910|3 months ago

Do they really need the model to be duplicious?

It's not like the CCP holds power though tight control of information, notice the tremendous amount of Chinese students who enroll every year before going back.

At the moment, they mostly censor their models post-answer generation and that seems to work fine enough for them.

ferguess_k|3 months ago

I think PRC officials are fine to lagging behind in the frontiers of AI. What they want is very fast deployment and good application. They don't fancy the next Nobel's prize but want a thousand use cases deployed.

boznz|3 months ago

Just as an aside; Why is "intelligence" always considered to be more data? Giving a normal human a smartphone does not make them as intelligent as Newton or Einstein, any entity with sufficient grounding in logic and theory that a normal schoolkid gets should be able to get to AGI, looking up any new data they need as required.

esafak|3 months ago

Would you say they face the same problem biologically, of reaching the state of the art in various endeavors while intellectually muzzling their population? If humans can do it why can't computers?

saubeidl|3 months ago

That is assuming the capitalist narrative preferred by US leadership is non-ideological.

I suspect both are bias factors.

skissane|3 months ago

> As the model get's more powerful, you can't simply train the model on your narrative if it doesn't align with real data/world.

> At some capacity, the model will notice and then it becomes a can of worms.

I think this is conflating “is” and “ought”, fact and value.

People convince themselves that their own value system is somehow directly entailed by raw facts, such that mastery of the facts entail acceptance of their values, and unwillingness to accept those values is an obstacle to the mastery of the facts-but it isn’t true.

Colbert quipped that “Reality has a liberal bias”-but does it really? Or is that just more bankrupt Fukuyama-triumphalism which will insist it is still winning all the way to its irreversible demise?

It isn’t clear that reality has any particular ideological bias-and if it does, it isn’t clear that bias is actually towards contemporary Western progressivism-maybe its bias is towards the authoritarianism of the CCP, Russia, Iran, the Gulf States-all of which continue to defy Western predictions of collapse-or towards their (possibly milder) relatives such as Modi’s India or Singapore or Trumpism. The biggest threat to the CCP’s future is arguably demographics-but that’s not an argument that reality prefers Western progressivism (whose demographics aren’t that great either), that’s an argument that reality prefers the Amish and Kiryas Joel (see Eric Kaufmann’s “Shall the Religious Inherit the Earth?”)

cheesecompiler|3 months ago

You say it like western nations don't operate on double-think, delusions of meritocracy, or power disproportionately concentrating in monopolies.

narrator|3 months ago

The glitchy stuff in the model reasoning is likely to come from the constant redefinition of words that communists and other ideologues like to engage in. For example "People's Democratic Republic of Korea."

zqy123007|3 months ago

There are different techniques and namings. Essentially, EVERY model is biased/aligned towards something, perhaps its creator's value. China or NOT. Look at Grok and read Elon Look at Claude and Dario

I am sure OpenAI and GDM have some secret alignment sets which are not pilled towards the interet of general public, they just smart enough to NOT talking about it out loud...

faxmeyourcode|3 months ago

I think they're referring to this study on LLM poisoning in the pretraining step: https://arxiv.org/abs/2510.07192 (related article: https://www.anthropic.com/research/small-samples-poison)

I'll admit I'm out of my element when discussing this stuff. Maybe somebody more plugged into the research can enlighten.

christina97|3 months ago

The ministry of state security is not issuing warnings due to an arXiv paper… it’s a different type of “poison”.

ksynwa|3 months ago

If you read the source, the concerns around poisoning are more sober than fear of wrongthink. Here is how firefox translated it for me:

> It leads to real-world risks. Data pollution can also pose a range of real-world risks, particularly in the areas of financial markets, public safety and health care.In the financial field, outlaws use AI to fabricate false information, causing data pollution, which may cause abnormal fluctuations in stock prices, and constitute a new type of market manipulation risk; in the field of public safety, data pollution is easy to disturb public perception, mislead public opinion, and induce social panic; in the field of medical and health, data pollution may cause models to generate wrong diagnosis and treatment suggestions, which not only endangers the safety of patients, but also aggravates the spread of pseudoscience.

lenkite|3 months ago

PRC just needs to sponsor a "Voice of China" and pay ¥¥¥/$$$/€€€/₹₹₹ to "journalists" and seed the web with millions of "China is Great" articles. Make sure to have 10k "contributors" on Wikipedia too. (I think they already do this).

Also use the NPM registry - put CCP slogans in the terminal! They will come in billions of ingestible build logs.

Problem will be easily solved.

cma|3 months ago

> and can tune a model towards whatever ideology you'd like.

Maybe possible, but, for example, Musk's recent attempts at getting Grok to always bolster him had Grok bragging Musk could drink the most piss in the world if humanity's fate depended on it and would be the absolute best at eating shit if that was the challenge.