top | item 30261508

(no title)

nephrite | 4 years ago

From the Wikipedia article:

Prefix code: The first byte indicates the number of bytes in the sequence. Reading from a stream can instantaneously decode each individual fully received sequence, without first having to wait for either the first byte of a next sequence or an end-of-stream indication. The length of multi-byte sequences is easily determined by humans as it is simply the number of high-order 1s in the leading byte. An incorrect character will not be decoded if a stream ends mid-sequence.

https://en.wikipedia.org/wiki/UTF-8#Comparison_with_other_en...

discuss

order

pierrebai|4 years ago

"instantaneously" in the sense of first having to read the first byte to know how many bytes to read. So it's a two-step process. Given the current maximum length and SIMD, detecting the end-byte of my scheme is easily parallelizable for up to 4 bytes, which conveniently goes to 24 bits, enough for all current unicode code points, so there is no waiting for termination. Furthermore, to decode a UTF-8 characters needs bits extraction and shifting of all bytes, so there is no practical gain of not looking at every byte. It actually makes the decoding loop more complex.

Also, the human readability sounds fishy. Humans are really bad at decoding high-order bits. For example can you tell the length of a UTF-8 sequence that would begin with 0xEC at a glance? With my scheme, either the high bit is not set (0x7F or less), which is easy to see you only need to compare the first digit to 7. Or the high bit is set and the high nibble is less than 0xC, meaning there is another byte, also easy to see, you compare the first digit to C.

The quote also implicitly mis-characterized the fact that in my scheme an incorrect character would also not be decoded if interrupted since it would lack the terminating flag (No byte > 0xC0).

account42|4 years ago

"Instantly" as in you get a stream of characters and with UTF-8 you always know as soon as you have received a full character. With your encoding it is always possible that you have not received the full character yet and need to wait until the start of the next character (or a timeout).