top | item 18763299

(no title)

ubernostrum | 7 years ago

From the perspective of Unicode, no. What you're looking for here is what Unicode calls "equivalence", and it comes in two variations: canonical equivalence and compatibility equivalence.

For example, "é" can be written as either U+00E9 LATIN SMALL LETTER E WITH ACUTE, or as the sequence U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT. These two options have canonical equivalence; what this means is that Unicode treats them as two ways of specifying exactly the same thing.

Now, consider "½". That's U+00BD VULGAR FRACTION ONE HALF. Generally you can replace that with the sequence U+0031 DIGIT ONE, U+002F SOLIDUS, U+0032 DIGIT TWO ("1/2"). This is not quite the same thing; most places where someone writes "½" can safely be replaced by "1/2", but not necessarily all, and it definitely doesn't work in reverse. This is compatibility equivalence, and under compatibility equivalence "½" maps to "1/2".

So to get to your actual question: U+017F LATIN SMALL LETTER LONG S has compatibility equivalence with U+0073 LATIN SMALL LETTER S. But U+03C2 GREEK SMALL LETTER FINAL SIGMA does not have any type of equivalence with U+03C3 GREEK SMALL LETTER SIGMA.

If you follow the general recommendations for things like comparing Unicode identifiers, you'll apply normalization to form NFKC (which decomposes by canonical equivalence, then recomposes by compatibility equivalence); this will turn a "ſ" into a "s". It will never turn a "ς" into a "σ".

discuss

order

cryptonector|7 years ago

If you're just comparing strings then just do character-at-a-time comparison, which allows you to decompose (no need to recompose) and only one character at a time (look ma', no allocation needed), compare the two decomposed characters' codepoints, then fail or move on to the next character. I call this form-insensitive string comparison.

ubernostrum|7 years ago

Inventing your own pseudo-normalization of Unicode is a worse idea than using the actual normalization forms Unicode defines.

Also, if you think you can decompose without allocating memory... well, try a code point like U+FDFA.

For reference, its decomposition is:

U+0635 U+0644 U+0649 U+0020 U+0627 U+0644 U+0644 U+0647 U+0020 U+0639 U+0644 U+064A U+0647 U+0020 U+0648 U+0633 U+0644 U+0645

(and that doesn't begin to touch any of the potential issues with variant forms, homoglyph attacks, etc.)