top | item 13877266

(no title)

digler999 | 9 years ago

what makes other encodings hard ? The two things that come to my mind are byte length and comparison function. If the encoding had a fixed-length byte length, then it should be just swapping n-bytes at a time instead of 1-byte. What else is difficult about non-ascii encodings ?

discuss

detaro|9 years ago

e.g. in UTF-8 a codepoint is encoded in varying byte lengths (so you have to split into codepoints and then reverse), and, a lot more difficult, a sequence of multiple codepoints can be combined to form a symbol. Simplest case would be something like "ö" encoded as "o" (U+006F) followed by a combining diaeresis (U+0308).

Other fun special cases: 🇺🇸 is U+1F1FA REGIONAL INDICATOR SYMBOL LETTER U, followed by U+1F1F8 REGIONAL INDICATOR SYMBOL LETTER S and should if possible be displayed as a US flag (otherwise falls back to text "US"), should reversing it create 🇸🇺 (replacing the flag with the characters "SU"), or still show the flag? (I'm not even sure if there isn't a case where both are valid country codes and it would change to a different flag?)

Similarly, Emoji can be formed from a sequence with combining characters inbetween, which don't display correctly if reversed codepoint by codepoint.

ryandrake|9 years ago

Some examples: If you're dealing with UTF-8, which is very common, you need to handle variable-length characters. If you're working with UTF-16 you need to handle surrogate pairs. Neither are the end of the world, but the basic "array walking" string reversal methods you'd expect from a white boarding session wouldn't work.