top | item 45225493

(no title)

Having the continuation bytes always start with the bits `10` also make it possible to seek to any random byte, and trivially know if you're at the beginning of a character or at a continuation byte like you mentioned, so you can easily find the beginning of the next or previous character.

If the characters were instead encoded like EBML's variable size integers[1] (but inverting 1 and 0 to keep ASCII compatibility for the single-byte case), and you do a random seek, it wouldn't be as easy (or maybe not even possible) to know if you landed on the beginning of a character or in one of the `xxxx xxxx` bytes.

[1]: https://www.rfc-editor.org/rfc/rfc8794#section-4.4

discuss

Animats|5 months ago

Right. That's one of the great features of UTF-8. You can move forwards and backwards through a UTF-8 string without having to start from the beginning.

Python has had troubles in this area. Because Python strings are indexable by character, CPython used wide characters. At one point you could pick 2-byte or 4-byte characters when building CPython. Then that switch was made automatic at run time. But it's still wide characters, not UTF-8. One emoji and your string size quadruples.

I would have been tempted to use UTF-8 internally. Indices into a string would be an opaque index type which behaved like an integer to the extent that you could add or subtract small integers, and that would move you through the string. If you actually converted the opaque type to a real integer, or tried to subscript the string directly, an index to the string would be generated. That's an unusual case. All the standard operations, including regular expressions, can work on a UTF-8 representation with opaque index objects.

nostrademons|5 months ago

PyCompactUnicodeObject was introduced with Python 3.3, and uses UTF-8 internally. It's used whenever both size and max code point are known, which is most cases where it comes from a literal or bytes.decode() call. Cut memory usage in typical Django applications by 2/3 when it was implemented.

https://peps.python.org/pep-0393/

I would probably use UTF-8 and just give up on O(1) string indexing if I were implementing a new string type. It's very rare to require arbitrary large-number indexing into strings. Most use-cases involve chopping off a small prefix (eg. "hex_digits[2:]") or suffix (eg. "filename[-3:]"), and you can easily just linear search these with minimal CPU penalty. Or they're part of library methods where you want to have your own custom traversals, eg. .find(substr) can just do Boyer-Moore over bytes, .split(delim) probably wants to do a first pass that identifies delimiter positions and then use that to allocate all the results at once.

btown|5 months ago

This is Python; finding new ways to subscript into things directly is a graduate student’s favorite pastime!

In all seriousness I think that encoding-independent constant-time substring extraction has been meaningful in letting researchers outside the U.S. prototype, especially in NLP, without worrying about their abstractions around “a 5 character subslice” being more complicated than that. Memory is a tradeoff, but a reasonably predictable one.

kccqzy|5 months ago

Indices into a Unicode string is a highly unusual operation that is rarely needed. A string is Unicode because it is provided by the user or a localized user-facing string. You don't generally need indices.

Programmer strings (aka byte strings) do need indexing operations. But such strings usually do not need Unicode.

johncolanduoni|5 months ago

Your solution is basically what Swift does. Plus they do the same with extended grapheme clusters (what a human would consider distinct characters mostly), and that’s the default character type instead of Unicode code point. Easily the best Unicode string support of any programming language.

cryptonector|5 months ago

Variable width encodings like UTF-8 and UTF-16 cannot be indexed in O(1), only in O(N). But this is not really a problem! Instead of indexing strings we need to slice them, and generally we read them forwards, so if slices (and slices of slices) are cheap, then you can parse textual data without a problem. Basically just keep the indices small and there's no problem.

zahlman|5 months ago

> If you actually converted the opaque type to a real integer, or tried to subscript the string directly, an index to the string would be generated.

What conversion rule do you want to use, though? You either reject some values outright, bump those up or down, or else start with a character index that requires an O(N) translation to a byte index.

otabdeveloper4|5 months ago

"Unicode" aka "wide characters" is the dumbest engineering debacle of the century.

> ascii and codepage encodings are legacy, let's standardize on another forwards-incompatible standard that will be obsolete in five years > oh, and we also need to upgrade all our infrastructure for this obsolete-by-design standard because we're now keeping it forever

sparkie|5 months ago

VLQ/LEB128 are a bit better than the EBML's variable size integers. You test the MSB in the byte - `0` means it's the end of a sequence and the next byte is a new sequence. If the MSB is `1`, to find the start of the sequence you walk back until you find the first zero MSB at the end of the previous sequence (or the start of the stream). There are efficient SIMD-optimized implementations of this.

The difference between VLQ and LEB128 is endianness, basically whether the zero MSB is the start or end of a sequence.

    0xxxxxxx                   - ASCII
    1xxxxxxx 0xxxxxxx          - U+0080 .. U+3FFF
    1xxxxxxx 1xxxxxxx 0xxxxxxx - U+4000 .. U+10FFFD

                      0xxxxxxx - ASCII
             0xxxxxxx 1xxxxxxx - U+0080 .. U+3FFF
    0xxxxxxx 1xxxxxxx 1xxxxxxx - U+4000 .. U+10FFFD

It's not self-synchronizing like UTF-8, but it's more compact - any unicode codepoint can fit into 3 bytes (which can encode up to 0x1FFFFF), and ASCII characters remain 1 byte. Can grow to arbitrary sizes. It has a fixed overhead of 1/8, whereas UTF-8 only has overhead of 1/8 for ASCII and 1/3 thereafter. Could be useful compressing the size of code that uses non-ASCII, since most of the mathematical symbols/arrows are < U+3FFF. Also languages like Japanese, since Katakana and Hiragana are also < U+3FFF, and could be encoded in 2 bytes rather than 3.

kstenerud|5 months ago

Unfortunately, VLQ/LEB128 is slow to process due to all the rolling decision points (one decision point per byte, with no ability to branch predict reliably). It's why I used a right-to-left unary code in my stuff: https://github.com/kstenerud/bonjson/blob/main/bonjson.md#le...

  | Header     | Total Bytes | Payload Bits |
  | ---------- | ----------- | ------------ |
  | `.......1` |      1      |       7      |
  | `......10` |      2      |      14      |
  | `.....100` |      3      |      21      |
  | `....1000` |      4      |      28      |
  | `...10000` |      5      |      35      |
  | `..100000` |      6      |      42      |
  | `.1000000` |      7      |      49      |
  | `10000000` |      8      |      56      |
  | `00000000` |      9      |      64      |

The full value is stored little endian, so you simply read the first byte (low byte) in the stream to get the full length, and it has the exact same compactness of VLQ/LEB128 (7 bits per byte).

Even better: modern chips have instructions that decode this field in one shot (callable via builtin):

https://github.com/kstenerud/ksbonjson/blob/main/library/src...

    static inline size_t decodeLengthFieldTotalByteCount(uint8_t header) {
        return (size_t)__builtin_ctz(header) + 1;
    }

After running this builtin, you simply re-read the memory location for the specified number of bytes, then cast to a little-endian integer, then shift right by the same number of bits to get the final payload - with a special case for `00000000`, although numbers that big are rare. In fact, if you limit yourself to max 56 bit numbers, the algorithm becomes entirely branchless (even if your chip doesn't have the builtin).

https://github.com/kstenerud/ksbonjson/blob/main/library/src...

It's one of the things I did to make BONJSON 35x faster to decode/encode compared to JSON.

https://github.com/kstenerud/bonjson

If you wanted to maintain ASCII compatibility, you could use a 0-based unary code going left-to-right, but you lose a number of the speed benefits of a little endian friendly encoding (as well as the self-synchronization of UTF-8 - which admittedly isn't so important in the modern world of everything being out-of-band enveloped and error-corrected). But it would still be a LOT faster than VLQ/LEB128.

deepsun|5 months ago

That's assuming the text is not corrupted or maliciously modified. There were (are) _numerous_ vulnerabilities due to parsing/escaping of invalid UTF-8 sequences.

Quick googling (not all of them are on-topic tho):

https://www.rapid7.com/blog/post/2025/02/13/cve-2025-1094-po...

https://www.cve.org/CVERecord/SearchResults?query=utf-8

restalis|5 months ago

This tendency of requirement overloading, for what can otherwise be a simple solution for a simple problem, is the bane of engineering. In this case, if security is important, it can be addressed separately, e.g. for the underlying text treated as an abstract information block that has to be packaged with corresponding error codes then checked for integrity before consumption. The UTF-8 encoding/decoding process itself doesn't necessarily have to answer the security concerns. Please let the solutions be simple, whenever they can be.

s1mplicissimus|5 months ago

I was just wondering a similar thing: If 10 implies start of character, doesn't that require 10 to never occur inside the other bits of a character?

PaulHoule|5 months ago

It's not uncommon when you want variable length encodings to write the number of extension bytes used in unary encoding

https://en.wikipedia.org/wiki/Unary_numeral_system

and also use whatever bits are left over encoding the length (which could be in 8 bit blocks so you write 1111/1111 10xx/xxxx to code 8 extension bytes) to encode the number. This is covered in this CS classic

https://archive.org/details/managinggigabyte0000witt

together with other methods that let you compress a text + a full text index for the text into less room than text and not even have to use a stopword list. As you say, UTF-8 does something similar in spirit but ASCII compatible and capable of fast synchronization if data is corrupted or truncated.

cryptonector|5 months ago

This is referred to as UTF-8 being "self-synchronizing". You can jump to the middle and find a codepoint boundary. You can read it backwards. You can read it forwards.

procaryote|5 months ago

also, the redundancy means that you get a pretty good heuristic for "is this utf-8". Random data or other encodings are pretty unlikely to also be valid utf-8, at least for non-tiny strings

jridgewell|5 months ago

This isn't quite right. In invalid UTF8, a continuation byte can also emit a replacement char if it's the start of the byte sequence. Eg, `0b01100001 0b10000000 0b01100001` outputs 3 chars: a�a. Whether you're at the beginning of an output char depends on the last 1-3 bytes.

rockwotj|5 months ago

> outputs 3 chars

You mean codepoints or maybe grapheme clusters?

Anyways yeah it’s a little more complicated but the principle of being able to truncate a string without splitting a codepoint in O(1) is still useful

spankalee|5 months ago

Wouldn't you only need to read backwards at most 3 bytes to see if you were currently at a continuation byte? With a max multi-byte size of 4 bytes, if you don't see a multi-byte start character by then you would know it's a single-byte char.

I wonder if a reason is similar though: error recovery when working with libraries that aren't UTF-8 aware. If you slice naively slice an array of UTF-8 bytes, a UTf-8 aware library can ignore malformed leading and trailing bytes and get some reasonable string out of it.

Sharlin|5 months ago

It’s not always possible to read backwards.

jancsika|5 months ago

> Having the continuation bytes always start with the bits `10` also make it possible to seek to any random byte, and trivially know if you're at the beginning of a character or at a continuation byte like you mentioned, so you can easily find the beginning of the next or previous character.

Given four byte maximum, it's a similarly trivial algo for the other case you mention.

The main difference I see is that UTF8 increases the chance of catching and flagging an error in the stream. E.g., any non-ASCII byte that is missing from the stream is highly likely to cause an invalid sequence. Whereas with the other case you mention the continuation bytes would cause silent errors (since an ASCII character would be indecipherable from continuation bytes).

Encoding gurus-- am I right?

KingLancelot|5 months ago

[deleted]

thesz|5 months ago

> so you can easily find the beginning of the next or previous character.

It is not true [1]. While it is not UTF-8 problem per se, it is a problem of how UTF-8 is being used.

[1] https://paulbutler.org/2025/smuggling-arbitrary-data-through...

layer8|5 months ago

Parent means “character” as defined here in Unicode: https://www.unicode.org/versions/Unicode17.0.0/core-spec/cha..., effectively code points. Meanings 2 and 3 in the Unicode glossary here: https://www.unicode.org/glossary/#character

1oooqooq|5 months ago

so you replace one costly sweeping with a costly sweeping. i wouldn't call that an advantage in any way over junping n bytes.

what you describe is the bare minimum so you even know what you are searching for while you scan pretty much everything every time.

hk__2|5 months ago

What do you mean? What would you suggest instead? Fixed-length encoding? It would take a looot of space given all the character variations you can have.