top | item 14517589

(no title)

zigzigzag | 8 years ago

To not only use UTF-8 as the internal string encoding but practically mandate it, if you want to remain safe.

UTF-8 is a fine transport format, but for raw runtime performance it's obviously going to be an issue if you ever need to iterate over characters, do substring matches, things like that because you can't do constant time "next char" or indexing.

UTF-16 doesn't let you do that either in the presence of combining characters, but they're pretty rare and for many operations it doesn't really matter.

discuss

burntsushi|8 years ago

I feel like your comment contains a lot of misunderstanding about UTF-8. For example, UTF-8 is self-synchronizing, which means you can indeed find the "next char" in constant time.

UTF-8 is certainly not a problem for runtime performance. Substring search, for example, is as straight-forward as you might imagine. You have a needle in UTF-8 and a haystack in UTF-8, and a straight-forward application of `memmem` will work just fine (for example). In fact, UTF-8 works out great for performance , because it's very simple to apply fast routines like `memchr`. e.g., If you `memchr` for `a`, then because of UTF-8's self-synchronizing property, any and all matches for `a` actually correspond to the codepoint U+0061.

Indexing works fine so long as your indices are byte offsets at valid UTF-8 boundaries. Byte offset indexing tends to be useful for mechanical transformations on a string. For example, if you know your `substring` starts as position `i` in `mystr`, then `&mystr[i + substring.len()..]` gives you the slice of `mystr` immediately following your substring in constant time. When all your APIs deal in byte offsets, this turns out to be a perfectly natural thing to do.

Generally speaking, indexing by Unicode codepoint isn't an operation you want to do, because it tends to betray the problem you're trying to solve. For example, if you wanted to display a trimmed string to an end user by "selecting the first 9 characters," then selecting the first 9 codepoints would result in bad things in some circumstances, and it's not just limited to the presence of combining characters. For example, UTF-16 encodes codepoints outside the basic multilingual plane using surrogate pairs, where a surrogate pair consists of two surrogate codepoints that combine to form a single Unicode scalar value (i.e., a non-surrogate codepoint). So if you do the "obvious" thing with UTF-16, you'll wind up with bad results in not-exactly-corner cases.

It's worth noting that Rust isn't alone in this. Go represents strings similarly and it also works remarkably well. (The only difference between Go and Rust is that Rust's string type is guaranteed to contain valid UTF-8 where as Go's string type is conventionally UTF-8.) Notably, you won't find "character indexing" anywhere in Go's standard library or various Unicode support libraries. :-)

I would very strongly urge you to read my link in my previous comment to you. I think it would help clarify a lot of misconceptions.

kccqzy|8 years ago

This has been repeated so many times but UTF-16 does not allow constant time indexing either. Combining characters are one case and they are not rare at all. What about surrogates? What about grapheme clusters that are a complicated sequence of emoji and emoji modifiers with ZWJ?

A better suggestion is to rethink why you need those operations in the first place.