(no title)
zigzigzag | 8 years ago
UTF-8 is a fine transport format, but for raw runtime performance it's obviously going to be an issue if you ever need to iterate over characters, do substring matches, things like that because you can't do constant time "next char" or indexing.
UTF-16 doesn't let you do that either in the presence of combining characters, but they're pretty rare and for many operations it doesn't really matter.
burntsushi|8 years ago
UTF-8 is certainly not a problem for runtime performance. Substring search, for example, is as straight-forward as you might imagine. You have a needle in UTF-8 and a haystack in UTF-8, and a straight-forward application of `memmem` will work just fine (for example). In fact, UTF-8 works out great for performance , because it's very simple to apply fast routines like `memchr`. e.g., If you `memchr` for `a`, then because of UTF-8's self-synchronizing property, any and all matches for `a` actually correspond to the codepoint U+0061.
Indexing works fine so long as your indices are byte offsets at valid UTF-8 boundaries. Byte offset indexing tends to be useful for mechanical transformations on a string. For example, if you know your `substring` starts as position `i` in `mystr`, then `&mystr[i + substring.len()..]` gives you the slice of `mystr` immediately following your substring in constant time. When all your APIs deal in byte offsets, this turns out to be a perfectly natural thing to do.
Generally speaking, indexing by Unicode codepoint isn't an operation you want to do, because it tends to betray the problem you're trying to solve. For example, if you wanted to display a trimmed string to an end user by "selecting the first 9 characters," then selecting the first 9 codepoints would result in bad things in some circumstances, and it's not just limited to the presence of combining characters. For example, UTF-16 encodes codepoints outside the basic multilingual plane using surrogate pairs, where a surrogate pair consists of two surrogate codepoints that combine to form a single Unicode scalar value (i.e., a non-surrogate codepoint). So if you do the "obvious" thing with UTF-16, you'll wind up with bad results in not-exactly-corner cases.
It's worth noting that Rust isn't alone in this. Go represents strings similarly and it also works remarkably well. (The only difference between Go and Rust is that Rust's string type is guaranteed to contain valid UTF-8 where as Go's string type is conventionally UTF-8.) Notably, you won't find "character indexing" anywhere in Go's standard library or various Unicode support libraries. :-)
I would very strongly urge you to read my link in my previous comment to you. I think it would help clarify a lot of misconceptions.
kccqzy|8 years ago
A better suggestion is to rethink why you need those operations in the first place.