top | item 29302133

TIL the assumption that string length does not change when upper-cased is false

171 points| dredmorbius | 4 years ago |chaos.social | reply

240 comments

order
[+] arcticbull|4 years ago|reply
The issue is that the concept of “length” without additional context no longer applies to Unicode strings. There’s not one length there’s four or five - at least.

1. Length in the context of storage or transmission is the number of bytes of its UTF-8 representation - and not just that, one of its decomposed forms too.

2. Length in terms of number of visible characters is the number of grapheme clusters.

3. Length in terms of visible space on screen is the number of pixels wide and tall when rendered in a given font.

4. Length when parsing is based on the number of code points.

If you look at it that way it makes perfect sense that the “length” of a string could change when any operation is performed on it - because there’s no such thing as a canonical length.

tl;dr bug or not, the concept of a canonical string length is an anachronism from the pre-Unicode days when all those were the same thing. There’s no such thing as a string length anymore.

[+] zionic|4 years ago|reply
Except almost everyone always means #2. No one asked for strings to be ruined in this way, and this kind of pedantry has caused untold frustration from developers who just want their strings to work properly.

If you must expose the underlying byte array do so via a sane access function that returns a typed array.

As for “string length in pixels”, that has absolutely nothing to do with the string itself as that’s determined in the UI layer that ingests the string.

[+] pvg|4 years ago|reply
This particular capitalization would have generated a longer string without Unicode - it's a language convention that a capitalization routine could apply to just about any encoding that has an ß.
[+] quietbritishjim|4 years ago|reply
That is an issue, for sure. But that is not "the issue" that they're describing. They are clearly talking about (2), so confusion about other meanings is irrelevant.

Their point is that they would expect number of visble grapheme clusters to be the same when converting between cases. But in some languages that's not true, as their example demonstrates. (An upper case ß does exist in Unicode but culturally it's less correct than SS.)

[+] ASalazarMX|4 years ago|reply
Hey, let's keep inventing lengths.

5. Length in terms of time until complete display when rendered in a given architecture.

6. Length in terms of Braille characters needed to display it.

7. Length in terms of time of lecture at 300 wpm.

String length is overwhelmingly discussed as character length, meaning #2. Length in bytes should only be an issue for data transmission or storage, and people who still work with ASCII. Rendering dimensions are not relevant to the published article.

[+] vadfa|4 years ago|reply
>3. Length in terms of visible space on screen is the number of pixels wide and tall when rendered in a given font.

Grasping at straws eh?

[+] _rend|4 years ago|reply
> 2. Length in terms of number of visible characters is the number of grapheme clusters.

There's a fun subtlety in this case too: a single grapheme cluster need not draw as a single "visible character" (or glyph) on-screen. The visual representation of a grapheme cluster is entirely dependent on the text system which draws this, and the meaning that system applies to the cluster itself. This is especially true for multi-element emoji clusters, whose recommended meanings[1] change with evolving versions of Unicode.

To add to this, Unicode 12 simplified the definition of grapheme clusters by actually generalizing them so that they can be matched effectively by a regex. (See the "extended grapheme cluster" definition in TR29[2].) This reduced the overall number of special cases and hard-coded lists of combinations in the definitions of grapheme clusters (particularly around emoji), but it also means that there are now infinitely more valid grapheme clusters that don't necessarily map to a single glyph.

One really simple example of this is, e.g.

     ©
(Edit: it appears that HN is actually stripping out the ZWJ text from this example and leaving just the Copyright symbol. See below for how to reproduce this text on your machine.)

That is,

    U+1F434 <horse head> + U+200D <zero width joiner> + U+00A9 <copyright sign>
This cluster trivially matches the definition of

    extended grapheme cluster :=   crlf | Control | precore* _core_ postcore*
    core :=  hangul-syllable | ri-sequence | xpicto-sequence | [^Control CR LF]
    xpicto-sequence :=  \p{Extended_Pictographic} (Extend* ZWJ \p{Extended_Pictographic})*
where

    U+1F434: Extended_Pictographic
    U+200D: ZWJ
    U+00A9: Extended_Pictographic
(I picked this combination somewhat randomly, but ideally, this is an example that should hopefully last as it feels unlikely that "horse copyright" would have a meaningful glyph definition in the future. As of posting this, the above text shows up as two side-by-side glyphs on my machine (macOS Monterey 21A559): a horse, followed by the copyright sign. This may look similar on your machine, or it may not.)

Importantly, you can tell this is actually treated as a real grapheme cluster by the text system on macOS because if you copy that string into a Cocoa text view (e.g., TextEdit), you will only be able to place your cursor on either side of the cluster, but not split it in the middle. A nice interactive way to see this in action is inserting U+1F434 into the document, followed by U+00A9. Then, move your cursor in between those two glyphs and insert U+200D: your cursor should then bounce out from the middle of the newly-formed cluster to the beginning.

This was a pretty short example, but this is arbitrarily extensible: (Edit: Originally I had posted U+2705 <check mark symbol> + U+200D + U+1F434 <horse head> + U+200D + U+1F50B <battery> + U+200D + U+1F9F7 <safety pin> [sorry, no staple emoji] but HN stripped that out too. It does appear correctly in the text area while typing, but HN replaces the sequence with spaces after posting.)

As linked above, Unicode does offer a list of sequences like this that are considered to be "meaningful"[1], which you can largely expect vendors which offer emoji representations to respect (and some vendors may offer glyphs for sequences beyond what is suggested here). If you've ever run into this: additions to this list over time explains why transporting a Unicode sequence which appears as a single glyph on one OS can appear as multiple glyphs on an older one (each individual glyph may be supported, but their combination may or may not have a meaning).

In general, if this is interesting to you, you may enjoy trawling through the Unicode emoji data files [3]. You may discover something new!

[1] https://www.unicode.org/Public/emoji/14.0/emoji-zwj-sequence... [2] https://www.unicode.org/reports/tr29/tr29-35.html#Table_Comb... [3] https://www.unicode.org/reports/tr51/#emoji_data

[+] varenc|4 years ago|reply
Another fun fact: Upper-casing is language dependent. In English uppercasing 'i' gets you 'I'. But Turkish has a dotted and un-dotted 'i', each with an uppercase variant. So if your user's language was Turkish, uppercasing 'i' would give you 'İ', and lowercasing 'I' would give you 'ı'.

Makes me wonder how case insensitive file systems handle this...and for more fun, handle the situation where the user changes the system language. I know that the Turkish 'I' delayed at least one big company's Turkish localization efforts for awhile.

[+] lifthrasiir|4 years ago|reply
> Makes me wonder how case insensitive file systems handle this...

They generally don't. It is true that several case-insensitive file systems (including NTFS, exFAT and ext4 [1]) maintain some sort of configurable case-folding maps but they are mostly used to guard against the future Unicode update and do not vary across locales.

[1] https://dfir.ru/2021/07/15/playing-with-case-insensitive-fil... (ext4 supports an optional case insensitivity)

[+] TimonKnigge|4 years ago|reply
Another example is that in Dutch, the bigram 'ij' is considered a single letter, and so at the beginning of a word, both have to be uppercased. See for example the Dutch Wikipedia page for Iceland: https://nl.wikipedia.org/wiki/IJsland.
[+] AlanYx|4 years ago|reply
>Upper-casing is language dependent.

It goes beyond just language... uppercasing can also be locale-dependent. In Microsoft Word, for example, uppercasing an "é" gets you "É" in the French (Canada) locale but "E" in the French (France) locale.

[+] toast0|4 years ago|reply
My understanding is that case insensitive filesystems that wish to be portable have a mapping table in the metadata. A quick search showed references to an 'upcase table', although I'm not sure of the accuracy of the source, so I won't link it.

Just because the user changed the system language doesn't mean the system should be expected to change the upcase table though. That operation would need to be very carefully managed; you can't change the rules if there are existing files that would conflict in the new mapping. And you might have symbolic links that matched because of case insensitivity that won't anymore... Pretty tricky.

[+] netcraft|4 years ago|reply
I sort of expect that nothing can be assumed when talking about strings and characters anymore. Waiting for the post on HN one day that says that its in the unicode spec that characters can animate or be in superposition until observed…
[+] ahefner|4 years ago|reply
I'm sometimes believe that full general purpose embracing of unicode for text, with no clean distinction between "machine-friendly" text versus "for human" natural language text (in every script since the dawn of time plus every goofy emoji anyone dreams up, with all the complexity these entail) is a major mistake that has lead computing astray. I fear, though, that it is impractical to separate these things, short of entirely shunning the latter, and tempting as it is I can't quite advocate a return to pure ASCII.
[+] hibbelig|4 years ago|reply
There are two German words that uppercase to the same: Masse (physical unit, measured in kg usually: mass) and Maße (plural of Maß: measurement). So downcasing MASSE either requires understanding the text or results in a superposition.
[+] tshaddox|4 years ago|reply
Or that some people’s names can only be represented as a bitmap or vector graphic. Or what if some people’s names can only be represented in a computer by that computer executing arbitrary code? Then all computer software that accepts the input of human names must by definition have arbitrary code execution vulnerabilities!
[+] gfxgirl|4 years ago|reply
Sometimes I wish the popular programming languages had been invented by people who's written language had no concept of UPPER and lower case. There's so much cruft in code bases because of it. A conventions include kUPPERCASE for constants, CapitalizedCamelCase for classes and or functions, sometimes snake_case for variables or whatever. So then you have millions of wasted person hours of things that could have been automated if they matched, but they don't

Example

    enum Commands {
      kCOPY,
      kCUT,
      kPASTE,
    }
    class CopyCmd : Public Cmd { ... }
    class CutCmd : Public Cmd { ... }
    class PasteCmd : Public Cmd { ... }
    
    Cmd* MakeCommand(Command cmd) {
      switch (cmd) {
        case kCut: return new CutCmd();
        case kCopy: return new CopyCmd();
        case kPaste: return new PasteCmd();
      ...
The fact that in some places it's COPY, in others Copy, and in probably others copy means work that would disappear if it was just always COPY. All of it superfluous work that someone from a language that doesn't have the concept of upper and lower case would never had even considered when come up with coding standards. Hey, I could just copy and paste this.... oh, the cases need to be fixed. Oh, I could code generate this with a macro.... oh, well, I've got to put every case form of the name in the macro so I can use the correct one, etc...
[+] avgcorrection|4 years ago|reply
This is an unreasonable expectation to have for Unicode anyway.

1. Assume that some letter X has only a lower-case version. It’s represented with two bytes in UTF-8. 2. A capitalized version is added way later 3. There are no more two-byte codepoints available 4. So it has to use three bytes or more

I see people are jumping on the “oh the wicked complexity” bandwagon in here but I don’t see what the big deal is.

[+] ironmagma|4 years ago|reply
Presumably string length corresponds to something like “number of glyphs” rather than byte count.
[+] heavyset_go|4 years ago|reply
Python has str.casefold() for caseless comparisons that handles the example in the OP[1]:

> str.casefold()

> Return a casefolded copy of the string. Casefolded strings may be used for caseless matching.

> Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. For example, the German lowercase letter 'ß' is equivalent to "ss". Since it is already lowercase, lower() would do nothing to 'ß'; casefold() converts it to "ss".

> The casefolding algorithm is described in section 3.13 of the Unicode Standard.

[1] https://docs.python.org/3/library/stdtypes.html#str.casefold

[+] eesmith|4 years ago|reply
Also, ligatures ("ffi" -> "FFI"). Here's the 102 single Unicode points which, after upper(), map to more than one point:

  >>> for i in range(1_114_112):
  ...   s = chr(i)
  ...   if len(s) != len(s.upper()): print(i, s, s.upper())
  ...
  223 ß SS
  329 ʼn ʼN
  496 ǰ J̌
  912 ΐ Ϊ́
  944 ΰ Ϋ́
  1415 և ԵՒ
  7830 ẖ H̱
    ...
  8188 ῼ ΩΙ
  64256 ff FF
  64257 fi FI
    ...
  64279 ﬗ ՄԽ
[+] bgro|4 years ago|reply
This sounds like a question they'd ask during a tech interview and when you ask which language or what they mean by "length," the "senior" developer says, "Heh, guess you're not one of us who work for a living. You couldn't even answer a basic programming question. You see, length means how long a thing is. Everyone knows that. NEXT."
[+] OisinMoran|4 years ago|reply
As heavyset_go mentioned casefolding is your friend here, but you'll also likely want to do some unidecoding too depending on your use case. With that lens, going from one to two characters is pretty tame as there are numerous examples of single characters that unidecode to eight! And also ones that go to zero.

We've done some pretty fun work at Inscribe ensuring things like this work across multiple languages, in particular for matching names. So we can now verify that a document does indeed belong to "Vladimir" even if the name on the document is "Владимир". The trickier part is not simply matching those, but returning the original raw result, as for that you need to keep track of how your normalization has changed the length of things.

If you're interested in this kind of thing, shoot me a mail at [email protected]. We're growing rapidly and hiring to match.

[+] ir193|4 years ago|reply
Case distinction does not happen in every language. I'm not sure if it's European-languages-specific.
[+] euske|4 years ago|reply
I just found that this applies to Python too.

    >>> 'Straße'.upper()
    'STRASSE'
[+] hulahoof|4 years ago|reply
Now this has me wondering how you would be able to reverse something like this to get something like:

>>> 'Straße'.upper().lower()

'straße'

instead of:

>>> 'Straße'.upper().lower()

'strasse'

[+] heavyset_go|4 years ago|reply
Use str.casefold() if you're using str.upper() or str.lower() for comparisons.
[+] darepublic|4 years ago|reply
Probably would have assumed the same. Though thinking about it harder and remembering i18n should provide a clue
[+] spullara|4 years ago|reply
What would have happened if Microsoft had decided they were only going to support ASCII?
[+] BlueTemplar|4 years ago|reply
They might have lost to IBM, who already had their own "Extended ASCII" ?
[+] sillysaurusx|4 years ago|reply
I'll paypal $20 to anyone who can name a situation where string length (in number of visible characters) is actually required by any reasonable algorithm.

(High roller, I know.)

No games, though. If you say "An algorithm to compute the length of the string in number of visible characters," that obviously is designed to pass the test rather than to do anything useful.

Maybe framing it as a bet will break the illusion that string length is ever required.

[+] ordiel|4 years ago|reply
Set top boxes (the things you use to decode paid tv signal and watch "the [shows] programing guide" ) uses a format called EIT (Event Information Table) which is transfered as an encoded binary file, for which MANY of the titles and tv shows info is capitalized (mostly* due to readability, you know so an L does not look like a 1) I worked on a proyect for such file generation and it is HIGHLY sensitive to the lengt of the data (since it will be encoded to a binary and if you add a single bite to the length of such file the rest of the info turns into garbage) ... Now that I think about it I may probably drop one of those guys still there a call to let them know the shows in german may have a bug... Can I get my 20 now?
[+] dotancohen|4 years ago|reply

  > I'll paypal $20 to anyone who can name a situation where string length (in number of visible characters) is actually required by any reasonable algorithm.
I've already described above where I've had this exact requirement. The client demands that "50 letters" be allowed in a title field for a website element. He means letters: spaces (and presumably newlines) don't count, neither do punctuation. Emojis were not a concern at the time. I wrote Javascript and PHP validators for client- and server-side validation and stored the string in a VARCHAR(127) to be safe.

And this is completely reasonable. He is a human, not "a computer person" (his terminology) and pays me to abstract from him those "computer things" which he has no desire to understand.

[+] yrral|4 years ago|reply
When you have fixed width font and need to size columns appropriately for the data displayed in eg: a table.
[+] kaetemi|4 years ago|reply
At the very least least you need to know which code point clusters are one visual character for stepping the cursor left and right through text in a text editing control.

Which is 99% of the work to also calculate the length in those terms.

[+] latch|4 years ago|reply
I'm not sure I understand what you're getting at. But if changing the case didn't alter the length, you could do it in-place, without having to allocate memory. I always thought this was the point?
[+] rtpg|4 years ago|reply
implementing tab-to-spacification in a way that requires tabs to work "as tabs" (i.e. you get your alignment along 4-space lines) with a mono-spaced font.

I suppose some terminal rendering (where you're taking in chunks and you might want to be a bit clever about jumping to spaces to figure out where breaks should go).

There is a notion of UTF codepoints that you can index into (see how Python does string indexing), though I generally think that people who whine a lot about string behavior for UTF tend to just not be reaching for the right tools.

[+] JoshTriplett|4 years ago|reply
Showing a column number in a text editor, or in a compiler error message that lists line:column so that an editor can jump to the error.
[+] detaro|4 years ago|reply
To not have users complain you're withholding characters when you implement artificial input lengths? ;) (although e.g. Twitter does not accurately count visible characters, despite the character limit being a prominent feature, suggesting its not that important)

"What size does this UI element have to fit this text in our specific mono-spaced font"?

[+] 3np|4 years ago|reply
Input validation.

“Full name” and “username” must both have a minimum length of 1.

[+] Retr0id|4 years ago|reply
You're programming in C, and you need to know the length of the buffer you're about to memcpy.