top | item 12611071

(no title)

niccaluim | 9 years ago

FWIW the Unicode spec describes combining marks as characters in their own right. So if the intent is to reverse characters, page 21 does the job. The resulting sequences will potentially be defective but not ill-formed.

That being said, an FAQ on combining characters points out that Unicode's definition of "character" may not match an end user's, and that it's best to use the word "grapheme" instead for clarity. (And that being said, if the typical end user knows what "grapheme" means, I'll eat my cat.)

So from a practical standpoint, it's best to make sure that any input to rev is in one of the composed normal forms.

(Incidentally, the proper sequence is <base character><combining character>…, not the other way around.)

discuss

bhaak|9 years ago

> So from a practical standpoint, it's best to make sure that any input to rev is in one of the composed normal forms.

But there are real world characters that don't have precomposed forms (IIRC e.g. indic scripts).

kps|9 years ago

  > Incidentally, the proper sequence is <base character><combining character>…, not the other way around.

A mistake in Unicode, IMHO. The other way around, it would have been possible to identify the end of a combining sequence without looking past the sequence. Also, ‘dead keys’ could have directly generated the required combining characters just like normal characters, rather than requiring special processing.