The issue is that the concept of “length” without additional context no longer applies to Unicode strings. There’s not one length there’s four or five - at least.
1. Length in the context of storage or transmission is the number of bytes of its UTF-8 representation - and not just that, one of its decomposed forms too.
2. Length in terms of number of visible characters is the number of grapheme clusters.
3. Length in terms of visible space on screen is the number of pixels wide and tall when rendered in a given font.
4. Length when parsing is based on the number of code points.
If you look at it that way it makes perfect sense that the “length” of a string could change when any operation is performed on it - because there’s no such thing as a canonical length.
tl;dr bug or not, the concept of a canonical string length is an anachronism from the pre-Unicode days when all those were the same thing. There’s no such thing as a string length anymore.
Except almost everyone always means #2. No one asked for strings to be ruined in this way, and this kind of pedantry has caused untold frustration from developers who just want their strings to work properly.
If you must expose the underlying byte array do so via a sane access function that returns a typed array.
As for “string length in pixels”, that has absolutely nothing to do with the string itself as that’s determined in the UI layer that ingests the string.
This particular capitalization would have generated a longer string without Unicode - it's a language convention that a capitalization routine could apply to just about any encoding that has an ß.
That is an issue, for sure. But that is not "the issue" that they're describing. They are clearly talking about (2), so confusion about other meanings is irrelevant.
Their point is that they would expect number of visble grapheme clusters to be the same when converting between cases. But in some languages that's not true, as their example demonstrates. (An upper case ß does exist in Unicode but culturally it's less correct than SS.)
5. Length in terms of time until complete display when rendered in a given architecture.
6. Length in terms of Braille characters needed to display it.
7. Length in terms of time of lecture at 300 wpm.
String length is overwhelmingly discussed as character length, meaning #2. Length in bytes should only be an issue for data transmission or storage, and people who still work with ASCII. Rendering dimensions are not relevant to the published article.
> 2. Length in terms of number of visible characters is the number of grapheme clusters.
There's a fun subtlety in this case too: a single grapheme cluster need not draw as a single "visible character" (or glyph) on-screen. The visual representation of a grapheme cluster is entirely dependent on the text system which draws this, and the meaning that system applies to the cluster itself. This is especially true for multi-element emoji clusters, whose recommended meanings[1] change with evolving versions of Unicode.
To add to this, Unicode 12 simplified the definition of grapheme clusters by actually generalizing them so that they can be matched effectively by a regex. (See the "extended grapheme cluster" definition in TR29[2].) This reduced the overall number of special cases and hard-coded lists of combinations in the definitions of grapheme clusters (particularly around emoji), but it also means that there are now infinitely more valid grapheme clusters that don't necessarily map to a single glyph.
(Edit: it appears that HN is actually stripping out the ZWJ text from this example and leaving just the Copyright symbol. See below for how to reproduce this text on your machine.)
(I picked this combination somewhat randomly, but ideally, this is an example that should hopefully last as it feels unlikely that "horse copyright" would have a meaningful glyph definition in the future. As of posting this, the above text shows up as two side-by-side glyphs on my machine (macOS Monterey 21A559): a horse, followed by the copyright sign. This may look similar on your machine, or it may not.)
Importantly, you can tell this is actually treated as a real grapheme cluster by the text system on macOS because if you copy that string into a Cocoa text view (e.g., TextEdit), you will only be able to place your cursor on either side of the cluster, but not split it in the middle. A nice interactive way to see this in action is inserting U+1F434 into the document, followed by U+00A9. Then, move your cursor in between those two glyphs and insert U+200D: your cursor should then bounce out from the middle of the newly-formed cluster to the beginning.
This was a pretty short example, but this is arbitrarily extensible: (Edit: Originally I had posted U+2705 <check mark symbol> + U+200D + U+1F434 <horse head> + U+200D + U+1F50B <battery> + U+200D + U+1F9F7 <safety pin> [sorry, no staple emoji] but HN stripped that out too. It does appear correctly in the text area while typing, but HN replaces the sequence with spaces after posting.)
As linked above, Unicode does offer a list of sequences like this that are considered to be "meaningful"[1], which you can largely expect vendors which offer emoji representations to respect (and some vendors may offer glyphs for sequences beyond what is suggested here). If you've ever run into this: additions to this list over time explains why transporting a Unicode sequence which appears as a single glyph on one OS can appear as multiple glyphs on an older one (each individual glyph may be supported, but their combination may or may not have a meaning).
In general, if this is interesting to you, you may enjoy trawling through the Unicode emoji data files [3]. You may discover something new!
Another fun fact: Upper-casing is language dependent. In English uppercasing 'i' gets you 'I'. But Turkish has a dotted and un-dotted 'i', each with an uppercase variant. So if your user's language was Turkish, uppercasing 'i' would give you 'İ', and lowercasing 'I' would give you 'ı'.
Makes me wonder how case insensitive file systems handle this...and for more fun, handle the situation where the user changes the system language. I know that the Turkish 'I' delayed at least one big company's Turkish localization efforts for awhile.
> Makes me wonder how case insensitive file systems handle this...
They generally don't. It is true that several case-insensitive file systems (including NTFS, exFAT and ext4 [1]) maintain some sort of configurable case-folding maps but they are mostly used to guard against the future Unicode update and do not vary across locales.
Another example is that in Dutch, the bigram 'ij' is considered a single letter, and so at the beginning of a word, both have to be uppercased. See for example the Dutch Wikipedia page for Iceland: https://nl.wikipedia.org/wiki/IJsland.
A somewhat surprising and interesting side effect of this can be found in the blog post "Hacking GitHub with Unicode's dotless i" [1], which is now fairly well known.
It goes beyond just language... uppercasing can also be locale-dependent. In Microsoft Word, for example, uppercasing an "é" gets you "É" in the French (Canada) locale but "E" in the French (France) locale.
My understanding is that case insensitive filesystems that wish to be portable have a mapping table in the metadata. A quick search showed references to an 'upcase table', although I'm not sure of the accuracy of the source, so I won't link it.
Just because the user changed the system language doesn't mean the system should be expected to change the upcase table though. That operation would need to be very carefully managed; you can't change the rules if there are existing files that would conflict in the new mapping. And you might have symbolic links that matched because of case insensitivity that won't anymore... Pretty tricky.
I sort of expect that nothing can be assumed when talking about strings and characters anymore. Waiting for the post on HN one day that says that its in the unicode spec that characters can animate or be in superposition until observed…
I'm sometimes believe that full general purpose embracing of unicode for text, with no clean distinction between "machine-friendly" text versus "for human" natural language text (in every script since the dawn of time plus every goofy emoji anyone dreams up, with all the complexity these entail) is a major mistake that has lead computing astray. I fear, though, that it is impractical to separate these things, short of entirely shunning the latter, and tempting as it is I can't quite advocate a return to pure ASCII.
There are two German words that uppercase to the same: Masse (physical unit, measured in kg usually: mass) and Maße (plural of Maß: measurement). So downcasing MASSE either requires understanding the text or results in a superposition.
Or that some people’s names can only be represented as a bitmap or vector graphic. Or what if some people’s names can only be represented in a computer by that computer executing arbitrary code? Then all computer software that accepts the input of human names must by definition have arbitrary code execution vulnerabilities!
Sometimes I wish the popular programming languages had been invented by people who's written language had no concept of UPPER and lower case. There's so much cruft in code bases because of it. A conventions include kUPPERCASE for constants, CapitalizedCamelCase for classes and or functions, sometimes snake_case for variables or whatever. So then you have millions of wasted person hours of things that could have been automated if they matched, but they don't
Example
enum Commands {
kCOPY,
kCUT,
kPASTE,
}
class CopyCmd : Public Cmd { ... }
class CutCmd : Public Cmd { ... }
class PasteCmd : Public Cmd { ... }
Cmd* MakeCommand(Command cmd) {
switch (cmd) {
case kCut: return new CutCmd();
case kCopy: return new CopyCmd();
case kPaste: return new PasteCmd();
...
The fact that in some places it's COPY, in others Copy, and in probably others copy means work that would disappear if it was just always COPY. All of it superfluous work that someone from a language that doesn't have the concept of upper and lower case would never had even considered when come up with coding standards. Hey, I could just copy and paste this.... oh, the cases need to be fixed. Oh, I could code generate this with a macro.... oh, well, I've got to put every case form of the name in the macro so I can use the correct one, etc...
This is an unreasonable expectation to have for Unicode anyway.
1. Assume that some letter X has only a lower-case version. It’s represented with two bytes in UTF-8.
2. A capitalized version is added way later
3. There are no more two-byte codepoints available
4. So it has to use three bytes or more
I see people are jumping on the “oh the wicked complexity” bandwagon in here but I don’t see what the big deal is.
Python has str.casefold() for caseless comparisons that handles the example in the OP[1]:
> str.casefold()
> Return a casefolded copy of the string. Casefolded strings may be used for caseless matching.
> Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. For example, the German lowercase letter 'ß' is equivalent to "ss". Since it is already lowercase, lower() would do nothing to 'ß'; casefold() converts it to "ss".
> The casefolding algorithm is described in section 3.13 of the Unicode Standard.
This sounds like a question they'd ask during a tech interview and when you ask which language or what they mean by "length," the "senior" developer says, "Heh, guess you're not one of us who work for a living. You couldn't even answer a basic programming question. You see, length means how long a thing is. Everyone knows that. NEXT."
As heavyset_go mentioned casefolding is your friend here, but you'll also likely want to do some unidecoding too depending on your use case. With that lens, going from one to two characters is pretty tame as there are numerous examples of single characters that unidecode to eight! And also ones that go to zero.
We've done some pretty fun work at Inscribe ensuring things like this work across multiple languages, in particular for matching names. So we can now verify that a document does indeed belong to "Vladimir" even if the name on the document is "Владимир". The trickier part is not simply matching those, but returning the original raw result, as for that you need to keep track of how your normalization has changed the length of things.
If you're interested in this kind of thing, shoot me a mail at [email protected]. We're growing rapidly and hiring to match.
I was testing this lately,Teradata, SQL server, Oracle and others just return ß for upper('ß'). Snowflake returns SS.
There is btw an upper letter defined for ß
I'll paypal $20 to anyone who can name a situation where string length (in number of visible characters) is actually required by any reasonable algorithm.
(High roller, I know.)
No games, though. If you say "An algorithm to compute the length of the string in number of visible characters," that obviously is designed to pass the test rather than to do anything useful.
Maybe framing it as a bet will break the illusion that string length is ever required.
Set top boxes (the things you use to decode paid tv signal and watch "the [shows] programing guide" ) uses a format called EIT (Event Information Table) which is transfered as an encoded binary file, for which MANY of the titles and tv shows info is capitalized (mostly* due to readability, you know so an L does not look like a 1) I worked on a proyect for such file generation and it is HIGHLY sensitive to the lengt of the data (since it will be encoded to a binary and if you add a single bite to the length of such file the rest of the info turns into garbage) ... Now that I think about it I may probably drop one of those guys still there a call to let them know the shows in german may have a bug... Can I get my 20 now?
> I'll paypal $20 to anyone who can name a situation where string length (in number of visible characters) is actually required by any reasonable algorithm.
I've already described above where I've had this exact requirement. The client demands that "50 letters" be allowed in a title field for a website element. He means letters: spaces (and presumably newlines) don't count, neither do punctuation. Emojis were not a concern at the time. I wrote Javascript and PHP validators for client- and server-side validation and stored the string in a VARCHAR(127) to be safe.
And this is completely reasonable. He is a human, not "a computer person" (his terminology) and pays me to abstract from him those "computer things" which he has no desire to understand.
At the very least least you need to know which code point clusters are one visual character for stepping the cursor left and right through text in a text editing control.
Which is 99% of the work to also calculate the length in those terms.
I'm not sure I understand what you're getting at. But if changing the case didn't alter the length, you could do it in-place, without having to allocate memory. I always thought this was the point?
implementing tab-to-spacification in a way that requires tabs to work "as tabs" (i.e. you get your alignment along 4-space lines) with a mono-spaced font.
I suppose some terminal rendering (where you're taking in chunks and you might want to be a bit clever about jumping to spaces to figure out where breaks should go).
There is a notion of UTF codepoints that you can index into (see how Python does string indexing), though I generally think that people who whine a lot about string behavior for UTF tend to just not be reaching for the right tools.
To not have users complain you're withholding characters when you implement artificial input lengths? ;) (although e.g. Twitter does not accurately count visible characters, despite the character limit being a prominent feature, suggesting its not that important)
"What size does this UI element have to fit this text in our specific mono-spaced font"?
[+] [-] arcticbull|4 years ago|reply
1. Length in the context of storage or transmission is the number of bytes of its UTF-8 representation - and not just that, one of its decomposed forms too.
2. Length in terms of number of visible characters is the number of grapheme clusters.
3. Length in terms of visible space on screen is the number of pixels wide and tall when rendered in a given font.
4. Length when parsing is based on the number of code points.
If you look at it that way it makes perfect sense that the “length” of a string could change when any operation is performed on it - because there’s no such thing as a canonical length.
tl;dr bug or not, the concept of a canonical string length is an anachronism from the pre-Unicode days when all those were the same thing. There’s no such thing as a string length anymore.
[+] [-] zionic|4 years ago|reply
If you must expose the underlying byte array do so via a sane access function that returns a typed array.
As for “string length in pixels”, that has absolutely nothing to do with the string itself as that’s determined in the UI layer that ingests the string.
[+] [-] darrenf|4 years ago|reply
[+] [-] pvg|4 years ago|reply
[+] [-] quietbritishjim|4 years ago|reply
Their point is that they would expect number of visble grapheme clusters to be the same when converting between cases. But in some languages that's not true, as their example demonstrates. (An upper case ß does exist in Unicode but culturally it's less correct than SS.)
[+] [-] ASalazarMX|4 years ago|reply
5. Length in terms of time until complete display when rendered in a given architecture.
6. Length in terms of Braille characters needed to display it.
7. Length in terms of time of lecture at 300 wpm.
String length is overwhelmingly discussed as character length, meaning #2. Length in bytes should only be an issue for data transmission or storage, and people who still work with ASCII. Rendering dimensions are not relevant to the published article.
[+] [-] vadfa|4 years ago|reply
Grasping at straws eh?
[+] [-] _rend|4 years ago|reply
There's a fun subtlety in this case too: a single grapheme cluster need not draw as a single "visible character" (or glyph) on-screen. The visual representation of a grapheme cluster is entirely dependent on the text system which draws this, and the meaning that system applies to the cluster itself. This is especially true for multi-element emoji clusters, whose recommended meanings[1] change with evolving versions of Unicode.
To add to this, Unicode 12 simplified the definition of grapheme clusters by actually generalizing them so that they can be matched effectively by a regex. (See the "extended grapheme cluster" definition in TR29[2].) This reduced the overall number of special cases and hard-coded lists of combinations in the definitions of grapheme clusters (particularly around emoji), but it also means that there are now infinitely more valid grapheme clusters that don't necessarily map to a single glyph.
One really simple example of this is, e.g.
(Edit: it appears that HN is actually stripping out the ZWJ text from this example and leaving just the Copyright symbol. See below for how to reproduce this text on your machine.)That is,
This cluster trivially matches the definition of where (I picked this combination somewhat randomly, but ideally, this is an example that should hopefully last as it feels unlikely that "horse copyright" would have a meaningful glyph definition in the future. As of posting this, the above text shows up as two side-by-side glyphs on my machine (macOS Monterey 21A559): a horse, followed by the copyright sign. This may look similar on your machine, or it may not.)Importantly, you can tell this is actually treated as a real grapheme cluster by the text system on macOS because if you copy that string into a Cocoa text view (e.g., TextEdit), you will only be able to place your cursor on either side of the cluster, but not split it in the middle. A nice interactive way to see this in action is inserting U+1F434 into the document, followed by U+00A9. Then, move your cursor in between those two glyphs and insert U+200D: your cursor should then bounce out from the middle of the newly-formed cluster to the beginning.
This was a pretty short example, but this is arbitrarily extensible: (Edit: Originally I had posted U+2705 <check mark symbol> + U+200D + U+1F434 <horse head> + U+200D + U+1F50B <battery> + U+200D + U+1F9F7 <safety pin> [sorry, no staple emoji] but HN stripped that out too. It does appear correctly in the text area while typing, but HN replaces the sequence with spaces after posting.)
As linked above, Unicode does offer a list of sequences like this that are considered to be "meaningful"[1], which you can largely expect vendors which offer emoji representations to respect (and some vendors may offer glyphs for sequences beyond what is suggested here). If you've ever run into this: additions to this list over time explains why transporting a Unicode sequence which appears as a single glyph on one OS can appear as multiple glyphs on an older one (each individual glyph may be supported, but their combination may or may not have a meaning).
In general, if this is interesting to you, you may enjoy trawling through the Unicode emoji data files [3]. You may discover something new!
[1] https://www.unicode.org/Public/emoji/14.0/emoji-zwj-sequence... [2] https://www.unicode.org/reports/tr29/tr29-35.html#Table_Comb... [3] https://www.unicode.org/reports/tr51/#emoji_data
[+] [-] varenc|4 years ago|reply
Makes me wonder how case insensitive file systems handle this...and for more fun, handle the situation where the user changes the system language. I know that the Turkish 'I' delayed at least one big company's Turkish localization efforts for awhile.
[+] [-] lifthrasiir|4 years ago|reply
They generally don't. It is true that several case-insensitive file systems (including NTFS, exFAT and ext4 [1]) maintain some sort of configurable case-folding maps but they are mostly used to guard against the future Unicode update and do not vary across locales.
[1] https://dfir.ru/2021/07/15/playing-with-case-insensitive-fil... (ext4 supports an optional case insensitivity)
[+] [-] TimonKnigge|4 years ago|reply
[+] [-] danielbarla|4 years ago|reply
[1] https://eng.getwisdom.io/hacking-github-with-unicode-dotless...
[+] [-] AlanYx|4 years ago|reply
It goes beyond just language... uppercasing can also be locale-dependent. In Microsoft Word, for example, uppercasing an "é" gets you "É" in the French (Canada) locale but "E" in the French (France) locale.
[+] [-] toast0|4 years ago|reply
Just because the user changed the system language doesn't mean the system should be expected to change the upcase table though. That operation would need to be very carefully managed; you can't change the rules if there are existing files that would conflict in the new mapping. And you might have symbolic links that matched because of case insensitivity that won't anymore... Pretty tricky.
[+] [-] netcraft|4 years ago|reply
[+] [-] colanderman|4 years ago|reply
You mean like language-dependent unified Han glyphs? [1]
[1] https://en.wikipedia.org/wiki/Han_unification#Examples_of_la...
[+] [-] ahefner|4 years ago|reply
[+] [-] hibbelig|4 years ago|reply
[+] [-] tshaddox|4 years ago|reply
[+] [-] gfxgirl|4 years ago|reply
Example
The fact that in some places it's COPY, in others Copy, and in probably others copy means work that would disappear if it was just always COPY. All of it superfluous work that someone from a language that doesn't have the concept of upper and lower case would never had even considered when come up with coding standards. Hey, I could just copy and paste this.... oh, the cases need to be fixed. Oh, I could code generate this with a macro.... oh, well, I've got to put every case form of the name in the macro so I can use the correct one, etc...[+] [-] avgcorrection|4 years ago|reply
1. Assume that some letter X has only a lower-case version. It’s represented with two bytes in UTF-8. 2. A capitalized version is added way later 3. There are no more two-byte codepoints available 4. So it has to use three bytes or more
I see people are jumping on the “oh the wicked complexity” bandwagon in here but I don’t see what the big deal is.
[+] [-] ironmagma|4 years ago|reply
[+] [-] heavyset_go|4 years ago|reply
> str.casefold()
> Return a casefolded copy of the string. Casefolded strings may be used for caseless matching.
> Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. For example, the German lowercase letter 'ß' is equivalent to "ss". Since it is already lowercase, lower() would do nothing to 'ß'; casefold() converts it to "ss".
> The casefolding algorithm is described in section 3.13 of the Unicode Standard.
[1] https://docs.python.org/3/library/stdtypes.html#str.casefold
[+] [-] eesmith|4 years ago|reply
[+] [-] bgro|4 years ago|reply
[+] [-] TimTheTinker|4 years ago|reply
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...
[+] [-] OisinMoran|4 years ago|reply
We've done some pretty fun work at Inscribe ensuring things like this work across multiple languages, in particular for matching names. So we can now verify that a document does indeed belong to "Vladimir" even if the name on the document is "Владимир". The trickier part is not simply matching those, but returning the original raw result, as for that you need to keep track of how your normalization has changed the length of things.
If you're interested in this kind of thing, shoot me a mail at [email protected]. We're growing rapidly and hiring to match.
[+] [-] cgtyoder|4 years ago|reply
[+] [-] ir193|4 years ago|reply
[+] [-] AtlasLion|4 years ago|reply
https://m.wikidata.org/wiki/Q9693
[+] [-] euske|4 years ago|reply
[+] [-] hulahoof|4 years ago|reply
>>> 'Straße'.upper().lower()
'straße'
instead of:
>>> 'Straße'.upper().lower()
'strasse'
[+] [-] heavyset_go|4 years ago|reply
[+] [-] darepublic|4 years ago|reply
[+] [-] spullara|4 years ago|reply
[+] [-] BlueTemplar|4 years ago|reply
[+] [-] sillysaurusx|4 years ago|reply
(High roller, I know.)
No games, though. If you say "An algorithm to compute the length of the string in number of visible characters," that obviously is designed to pass the test rather than to do anything useful.
Maybe framing it as a bet will break the illusion that string length is ever required.
[+] [-] ordiel|4 years ago|reply
[+] [-] dotancohen|4 years ago|reply
And this is completely reasonable. He is a human, not "a computer person" (his terminology) and pays me to abstract from him those "computer things" which he has no desire to understand.
[+] [-] yrral|4 years ago|reply
[+] [-] kaetemi|4 years ago|reply
Which is 99% of the work to also calculate the length in those terms.
[+] [-] latch|4 years ago|reply
[+] [-] rtpg|4 years ago|reply
I suppose some terminal rendering (where you're taking in chunks and you might want to be a bit clever about jumping to spaces to figure out where breaks should go).
There is a notion of UTF codepoints that you can index into (see how Python does string indexing), though I generally think that people who whine a lot about string behavior for UTF tend to just not be reaching for the right tools.
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] JoshTriplett|4 years ago|reply
[+] [-] detaro|4 years ago|reply
"What size does this UI element have to fit this text in our specific mono-spaced font"?
[+] [-] 3np|4 years ago|reply
“Full name” and “username” must both have a minimum length of 1.
[+] [-] Retr0id|4 years ago|reply
[+] [-] furstenheim|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]