top | item 9827707

(no title)

Skalman | 10 years ago

It's an encoding that isn't good at anything: it's neither ASCII-compatible (like UTF-8), nor fixed-length (like UTF-32), but because most characters require only 2 bytes, developers frequently assume that none require more, leading to bugs when a character eventually is represented by 4 bytes.

discuss

asveikau|10 years ago

> fixed-length (like UTF-32),

Utf-32 is only fixed length if you don't care about diacritics, variation selectors, RTL languages, and others. Unicode is not one code point or one char/wchar/uint32 per glyph.

gilgoomesh|10 years ago

You've changed topic from code points to grapheme clusters. Rust's character/string support is strictly for code points (the documentation is fairly clear about the distinction).

Few string libraries actually deal with grapheme clusters as the native underlying representation (Swift being a notable exception).

sidarape|10 years ago

My life is a lie.

hamstergene|10 years ago

UTF-32 is not good for anything either, easy access to codepoints is just as useless as access to UTF-8 bytes. Any meaningful operation on text (even counting number of characters) requires parsing grapheme clusters, which have variable length regardless of what encoding is used.

yurish|10 years ago

I don't know much about Rust and Rust library, so I have a question: what if I what to develop Windows only software in Rust, will I need to convert back and forth between UTF-16 and UTF-8 (or whatever Rust uses in other parts of the library)?

tatterdemalion|10 years ago

Yes.

The Rust std library had to pick a string encoding, and it picked UTF-8 (which is really the best Unicode encoding). The String type is platform neutral and always UTF-8.

However, it does provide an OsString type, which on windows is UTF-16. Maybe there is a library - and if not, one could be written - targeting Windows only, and implementing stronger UTF-16 string processing on the OsString type.

EDIT: To be clear, Rust's trait system makes this very easy to do. You just define all the methods you want OsString to have in a trait WindowsString, and implement it for OsString, even though OsString is a std library type. One of the great things about Rust is that its trivial to use the std library as shared "pivot" which various third party libraries extend according to your use case.

steveklabnik|10 years ago

We have an http://doc.rust-lang.org/stable/std/ffi/struct.OsString.html to abstract over a native string in whatever encoding your platform has. Generally, things that interact with the OS use these, and they can convert to a UTF-8 String.

acdha|10 years ago

Since the full bullet point was “UTF-16 or UCS-2 support anywhere outside windows API compatibility routines” I'm assuming you'd get UTF-8 out of any high-level interface.