top | item 14516743

(no title)

zigzigzag | 8 years ago

That seems like a really unfortunate design decision. I used to think that Java's use of UTF-16 for strings was just a problematic legacy thing, but compared to this it seems quite good. Strings are pretty high performance and there are no complex calculations to do indexing or bounds checks. And in Java 9 the JVM can switch between UTF-16 or Latin1 encodings on the fly, which both uses less RAM and speeds things up simultaneously. There are no memory safety issues caused by character encodings.

discuss

burntsushi|8 years ago

I think there might be a few misconceptions in your comment, but it's hard to say for sure. This article isn't written in a style that I particularly enjoy, but it's not too long and contains some useful facts that might help dispel those misconceptions: http://utf8everywhere.org

I do have my unrelated niche complaints about Rust's string story (and I have vague plans to resolve them), but Rust's string implementation is my favorite among any other language I've used.

kibwen|8 years ago

What seems like an unfortunate design decision? I'm unclear what concerns you're seeing that would suggest UTF-16 as preferable to UTF-8, either in terms of performance or memory safety.

zigzigzag|8 years ago

To not only use UTF-8 as the internal string encoding but practically mandate it, if you want to remain safe.

UTF-8 is a fine transport format, but for raw runtime performance it's obviously going to be an issue if you ever need to iterate over characters, do substring matches, things like that because you can't do constant time "next char" or indexing.

UTF-16 doesn't let you do that either in the presence of combining characters, but they're pretty rare and for many operations it doesn't really matter.