top | item 29753100

(no title)

mr_luc | 4 years ago

Heh, funny, I'm implementing this exact thing at the moment, oddly enough -- rather, implementing a security check that provides that same guarantee you mention, Mixed Script protections.

In Unicode spec terms, 'UTS 39 (Security)' contains the description of how to do this, mostly in section 5, and it relies on 'UTX 24 (Scripts)'.

It's more nuanced than your example but only slightly. If you replace "German" with "Japanese" you're talking about multiple scripts in the same 'writing system', but the spec provides files with the lists of 'sets of scripts' each character belongs to.

The way that the spec tells us to ensure that the word 'microsoft' isn't made up of fishy characters is that we just keep the intersection of each character's augmented script sets. If at the end, that intersection is empty, that's often fishy -- ie, there's no intersection between '{Latin}, {Cyrillic}'.

However, the spec allows the legit uses of writing systems that use more than one script; the lookup procedure outlined in the spec could give script sets like '{Jpan, Kore, Hani, Hanb}, {Jpan, Kana}' for two characters, and that intersection isn't empty; it'd give us the answer "Okay, this word is contained within the Japanese writing system".

discuss

unknown|4 years ago

[deleted]