It's an encoding that isn't good at anything: it's neither ASCII-compatible (like UTF-8), nor fixed-length (like UTF-32), but because most characters require only 2 bytes, developers frequently assume that none require more, leading to bugs when a character eventually is represented by 4 bytes.
asveikau|10 years ago
Utf-32 is only fixed length if you don't care about diacritics, variation selectors, RTL languages, and others. Unicode is not one code point or one char/wchar/uint32 per glyph.
gilgoomesh|10 years ago
Few string libraries actually deal with grapheme clusters as the native underlying representation (Swift being a notable exception).
sidarape|10 years ago
hamstergene|10 years ago
yurish|10 years ago
tatterdemalion|10 years ago
The Rust std library had to pick a string encoding, and it picked UTF-8 (which is really the best Unicode encoding). The String type is platform neutral and always UTF-8.
However, it does provide an OsString type, which on windows is UTF-16. Maybe there is a library - and if not, one could be written - targeting Windows only, and implementing stronger UTF-16 string processing on the OsString type.
EDIT: To be clear, Rust's trait system makes this very easy to do. You just define all the methods you want OsString to have in a trait WindowsString, and implement it for OsString, even though OsString is a std library type. One of the great things about Rust is that its trivial to use the std library as shared "pivot" which various third party libraries extend according to your use case.
steveklabnik|10 years ago
acdha|10 years ago