Raku seems to be more correct (DWIM) in this regard than all the examples given in the post...
my \emoji = "\c[FACE PALM]\c[EMOJI MODIFIER FITZPATRICK TYPE-3]\c[ZERO WIDTH JOINER]\c[MALE SIGN]\c[VARIATION SELECTOR-16]";
#one character
say emoji.chars; # 1
#Five code points
say emoji.codes; # 5
#If I want to know how many bytes that takes up in various encodings...
say emoji.encode('UTF8').bytes; # 17 bytes
say emoji.encode('UTF16').bytes; # 14 bytes
Edit: Updated to use the names of each code point since HN cannot display the emoji
While I agree with this assessment, it means that these are the basic string operations:
* Passing strings around
* Reading and writing strings from files / sockets
* Concatenation
Anything else should reckon with extended grapheme clusters, whether it does so or not. Even proper upcasing is impossible without knowing, for one example, whether or not the string is in Turkish.
Can you expand on this? I don't see why knowing the number of code units would be useful except when calculating the total size of the string to allocate memory. Basic string operations, such as converting to uppercase, would operate on codepoints, regardless of how many code units are used to encode that codepoint.
Converting 'Á' to 'á', for example, is an operation on one codepoint but multiple code units.
What happens if you try to express dż or dź from polish orthography?
You can use
dż - \u0064\u017c - d followed by 'LATIN SMALL LETTER Z WITH DOT ABOVE'
dż - \u0064\u007a\u0307 - d followed by z, followed by combining diacritical dot above
dż - \u01f3\u0307 - dz with combining diacritical dot above
multiplied by uppercase and titlecase forms
In polish orthography dz digraph is considered 2 letters, despite being only one sound (głoska). I'm not so sure about macedonian orthography, they might count it as one thing.
Medieval ß is a letter/ligature that was created from ſʒ - that is a long s and a tailed z. In other words it is a form of 'sz' digraph.
Contemporarily it is used only in german orthography.
How long is ß?
By some rules uppercasing ß yields SS or SZ. Should uppercasing or titlecasing operations change length of a string?
> The first is useful for basic string operations.
The only thing it's useful for is sizing up storage. It does nothing for "basic string operations" unless "basic string operations" are solely 7-bit ascii manipulations.
Why do people prefer UTF-8 coordinates? While for storage I think we should use UTF-8, when working with strings live it’s just so much easier to use UTF-16 because it’s predictable: 1 unit for the basic plane and 2 for everything else (the multi-character emoji and modifier stuff aside). I am probably biased because I mostly use and think about DOMStrings which are always UTF-16 but I’m not sure why people who use languages which are more flexible about string representations than JavaScript would not also appreciate this kind of regularity.
I agree. The author talks a little bit about which UTF encoding makes sense in which situation, but they never make an argument about which result from len is correct.
My two cents is that string length should always be the number of Unicode codepoints in the string, regardless of encoding. If you want the byte length, I'm sure there is a sizeof equivalent for your language.
When we call len() on an array, we want the number of objects in the array. When we iterate over an array, we want to deal with one object from the array at a time. We don't care how big an object is. A large object shouldn't count for more than 1 when calculating the array length.
Similarly, a unicode codepoint is the fundamental object in a string. The byte-size of a codepoint does not affect the length of the string. It makes no sense to iterate over each byte in a unicode string, because a byte on its own is completely meaningless. len() should be the number of objects we can iterate over, just like in arrays.
perhaps the issue is that there's a canonical "length" at all. It would make more sense to me to have different types of length depending on which measure you're after, like Swift apparently has but without the canonical `.count`. Because when there's multiple interpretations of a thing's length, when you ask for "length" you're leaving the developers to resolve the ambiguity and I'm of the firm belief that developers shouldn't consider themselves psychic.
The main reason, I think, that Swift strings have `count` is that they conform to the `Collection` protocol. Swift's stdlib has a pervasive "generic programming" philosophy, enabled by various protocol heirarchies.
So, given that the property is required to be present, some semantic or the other had to be chosen. I am sure there were debates when they were writing `String` about which one was proper.
If you're working with an image, you might have an Image class, that has a Image.width and a Image.height in pixels, regardless of how these pixels are laid out in memory (depends on encoding, colorspace, etc). Most if not all methods operate on these pixels, e.g. cropping, scaling, color filtering, etc. Then, there might be a Image.memory property that provides acces to the underlying, actual bytes.
I don't understand why the same is not the obvious solution for strings. len("🇨🇦") should be 1, because we as humans read one emoji as one character, regardless of the memory layout behind. Most if not all methods operate on characters (reversing, concatenating, indexing).
And then, if you need access to the low level data, the String.memory would contain the actual bytes... which would be different depending on the actual text encoding.
The number of bytes necessary is incredibly important for security reasons. It's arguably better to make the number of bytes be the primary value and have a secondary lookup be the number of glyphs.
To be fair, some systems distinguish between size and length (with size expected to be O(1) and length allowed to be up to O(n)). For those systems proceed as parent.
Python 3's approach is clearly the best. Because it focuses on the problem at hand, unicode codepoints. A string in python is a sequence of unicode codepoints, it's length should be the number of codepoints in the sequence, it has nothing to do with bytes.
To draw an absurd parallel "[emoji]".len() == 17 is equivalent to [1,2].len() == 8 (2 32 bit integers)
In my opinion the most useful result in the case the article describes is 5. There should of course be a way to get 1 (the number of extended graphemes), but it should not be the strings "length".
Don't Swift and Go support iterating over graphemes? Edit: yes, Swift is mentioned at the bottom of the article.
It'd be great to have a function for that in other scripting languages like Python, Ruby, etc.
There was an interesting sub-thread here on HN a while ago about how a string-of-codepoints is just as bad as a string-of-bytes (subject to truncation, incorrect iteration, incorrect length calculation) and that we should just have string-of-bytes if we can't have string-of-graphemes. I don't agree, but some people felt very strongly about it.
If you do a formula in Google Sheets and it contains an emoji, this comes into play. For example, if the first character is an emoji and you want to reference it you need to do =LEFT(A1, 2) - and not =LEFT(A1, 1)
We need to deal with mountains of user input from mobile devices and the worst we have run into is "smart" quotes from iOS devices hosing older business systems. Our end users are pretty good about not typing pirate flags into customer name fields.
I still haven't run into a situation where I need to count the logical number of glyphs in a string. Any system that we need to push a string into will be limiting things on the basis of byte counts, so a .Length check is still exactly what we need.
I have used glyph counting a handful of times, mostly for width computing before I learned there were better ways. I'm 100% sure my logic was just waiting to fail on any input that didn't use the Latin alphabet.
Edit: confirmed, it was on HN just half a day prior. Probably why this article arrived too. Just surprised nobody reference this other one in this thread :)
I like how Go handles this case by providing the utf8.DecodeRune/utf8.DecodeRuneInString functions which return each individual "rune" as well as the size in code points.
Coming from python2, for me it was the first time I saw a language handle unicode so gracefully.
17 in php with strlen, which is defined as counting bytes.
1 when you use grapheme_strlen.
mb_strlen and iconv_strlen return 5, and 5 is rather useless as the article says.
I was wondering what the title meant. Turns out HN’s emoji stripper screwed with the title.
It’s asking why a skin toned (not yellow) facepalm emoji’s length is 7 when the user perceives it as a single character.
Tangent: Emojis are an interesting topic in regards to programming. They challenged the “rule” held by programmers that every character is a single codepoint (of 8 or 16 bits). So, str[1234] would get me the 1235th character, but it’s actually the 1235th byte. UTF-8 threw a wrench in that, but many programmers went along ignoring reality.
Sadly, preexisting languages such as Arabic weren’t warning enough in regards to line breaking. As in: an Arabic “character”[a] can change its width depending on if there’s a “character” before or after it (which gives it its cursive-like look). So, a naive line breaking routine could cause bugs if it tried to break in the middle of a word. Tom Scott has a nice video on it that was filmed when the “effective power” crash was going around.[0]
[a]: Arabic script isn’t technically an alphabet like Latin characters are. It’s either an abugida or abjad (depending on who you ask). See Wikipedia: https://en.wikipedia.org/wiki/Arabic_script
I’m curious: is the no-emoji rule a rule that happens to block emoji or a hardcoded rule? What I mean is: emojis (in UTF-16) have to use surrogate pairs because all their code points are in the U+1xxxx plane. Is the software just disallowing any characters needing two code points to encode (which would include emoji)? Or is it specifically singling out the emoji blocks?
Counterpoint: Appending U+1F609, U+1F61C, or U+1F643 to the end of your comment would have added immense value in communicating your point because it would have lightened the otherwise harsh tone of your words :)
[+] [-] patrickas|5 years ago|reply
[+] [-] cjm42|5 years ago|reply
No such method 'length' for invocant of type 'Str'. Did you mean any of these: 'codes', 'chars'?
Because as the article points out, the "length" of a string is an ambiguous concept these days.
[+] [-] duskwuff|5 years ago|reply
[+] [-] ChrisSD|5 years ago|reply
* Number of native (preferably UTF-8) code units
* Number of extended grapheme clusters
The first is useful for basic string operations. The second is good for telling you what a user would consider a "character".
[+] [-] wodenokoto|5 years ago|reply
Reversing a string is what I would consider basic string operations, but I also expect it not to break emoji and other grapheme clusters.
Nothing is easy.
[+] [-] samatman|5 years ago|reply
* Passing strings around
* Reading and writing strings from files / sockets
* Concatenation
Anything else should reckon with extended grapheme clusters, whether it does so or not. Even proper upcasing is impossible without knowing, for one example, whether or not the string is in Turkish.
[+] [-] dahfizz|5 years ago|reply
> The first is useful for basic string operations
Can you expand on this? I don't see why knowing the number of code units would be useful except when calculating the total size of the string to allocate memory. Basic string operations, such as converting to uppercase, would operate on codepoints, regardless of how many code units are used to encode that codepoint.
Converting 'Á' to 'á', for example, is an operation on one codepoint but multiple code units.
[+] [-] emergie|5 years ago|reply
You can use
In polish orthography dz digraph is considered 2 letters, despite being only one sound (głoska). I'm not so sure about macedonian orthography, they might count it as one thing.Medieval ß is a letter/ligature that was created from ſʒ - that is a long s and a tailed z. In other words it is a form of 'sz' digraph. Contemporarily it is used only in german orthography.
How long is ß?
By some rules uppercasing ß yields SS or SZ. Should uppercasing or titlecasing operations change length of a string?
[+] [-] masklinn|5 years ago|reply
The only thing it's useful for is sizing up storage. It does nothing for "basic string operations" unless "basic string operations" are solely 7-bit ascii manipulations.
[+] [-] bikeshaving|5 years ago|reply
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] SamBam|5 years ago|reply
> 'But It’s Better that "[emoji]".len() == 17 and Rather Useless that len("[emoji]") == 5'
It sounds like it's just whether you're counting UTF-8/-16/-32 units. Does the article explain why one is worse and one is "rather useless?"
[+] [-] dahfizz|5 years ago|reply
My two cents is that string length should always be the number of Unicode codepoints in the string, regardless of encoding. If you want the byte length, I'm sure there is a sizeof equivalent for your language.
When we call len() on an array, we want the number of objects in the array. When we iterate over an array, we want to deal with one object from the array at a time. We don't care how big an object is. A large object shouldn't count for more than 1 when calculating the array length.
Similarly, a unicode codepoint is the fundamental object in a string. The byte-size of a codepoint does not affect the length of the string. It makes no sense to iterate over each byte in a unicode string, because a byte on its own is completely meaningless. len() should be the number of objects we can iterate over, just like in arrays.
[+] [-] beaconstudios|5 years ago|reply
[+] [-] anaerobicover|5 years ago|reply
So, given that the property is required to be present, some semantic or the other had to be chosen. I am sure there were debates when they were writing `String` about which one was proper.
[+] [-] j1elo|5 years ago|reply
I don't understand why the same is not the obvious solution for strings. len("🇨🇦") should be 1, because we as humans read one emoji as one character, regardless of the memory layout behind. Most if not all methods operate on characters (reversing, concatenating, indexing).
And then, if you need access to the low level data, the String.memory would contain the actual bytes... which would be different depending on the actual text encoding.
[+] [-] dnautics|5 years ago|reply
To be fair, some systems distinguish between size and length (with size expected to be O(1) and length allowed to be up to O(n)). For those systems proceed as parent.
[+] [-] throwawayffffas|5 years ago|reply
To draw an absurd parallel "[emoji]".len() == 17 is equivalent to [1,2].len() == 8 (2 32 bit integers)
In my opinion the most useful result in the case the article describes is 5. There should of course be a way to get 1 (the number of extended graphemes), but it should not be the strings "length".
[+] [-] nerdponx|5 years ago|reply
It'd be great to have a function for that in other scripting languages like Python, Ruby, etc.
There was an interesting sub-thread here on HN a while ago about how a string-of-codepoints is just as bad as a string-of-bytes (subject to truncation, incorrect iteration, incorrect length calculation) and that we should just have string-of-bytes if we can't have string-of-graphemes. I don't agree, but some people felt very strongly about it.
[+] [-] banthar|5 years ago|reply
Nobody ever cares about unicode codepoints. You either want the number of bytes or the width of the string on screen.
UTF-32 codepoints waste space and give you neither.
[+] [-] anaerobicover|5 years ago|reply
[+] [-] eximius|5 years ago|reply
But I believe this can be handled explicitly and well and I'm trying to do that in my fuzzy string matching library based on fuzzywuzzy.
https://github.com/logannc/fuzzywuzzy-rs/pull/26/files
[+] [-] sanedigital|5 years ago|reply
[+] [-] donaldihunter|5 years ago|reply
[+] [-] hateful|5 years ago|reply
[+] [-] bob1029|5 years ago|reply
I still haven't run into a situation where I need to count the logical number of glyphs in a string. Any system that we need to push a string into will be limiting things on the basis of byte counts, so a .Length check is still exactly what we need.
Does this cause trouble for anyone else?
[+] [-] ravi-delia|5 years ago|reply
[+] [-] Lightbody|5 years ago|reply
https://tonsky.me/blog/emoji/
Edit: confirmed, it was on HN just half a day prior. Probably why this article arrived too. Just surprised nobody reference this other one in this thread :)
[+] [-] waterside81|5 years ago|reply
Coming from python2, for me it was the first time I saw a language handle unicode so gracefully.
[+] [-] maxnoe|5 years ago|reply
[+] [-] rurban|5 years ago|reply
In C land it is called ucwidth (libunistring). Length is too arbitrary. Byte? Unicode points? Utf-8 length?
[+] [-] tzs|5 years ago|reply
[+] [-] Rizz|5 years ago|reply
[+] [-] colejohnson66|5 years ago|reply
It’s asking why a skin toned (not yellow) facepalm emoji’s length is 7 when the user perceives it as a single character.
Tangent: Emojis are an interesting topic in regards to programming. They challenged the “rule” held by programmers that every character is a single codepoint (of 8 or 16 bits). So, str[1234] would get me the 1235th character, but it’s actually the 1235th byte. UTF-8 threw a wrench in that, but many programmers went along ignoring reality.
Sadly, preexisting languages such as Arabic weren’t warning enough in regards to line breaking. As in: an Arabic “character”[a] can change its width depending on if there’s a “character” before or after it (which gives it its cursive-like look). So, a naive line breaking routine could cause bugs if it tried to break in the middle of a word. Tom Scott has a nice video on it that was filmed when the “effective power” crash was going around.[0]
[0]: https://youtu.be/hJLMSllzoLA
[a]: Arabic script isn’t technically an alphabet like Latin characters are. It’s either an abugida or abjad (depending on who you ask). See Wikipedia: https://en.wikipedia.org/wiki/Arabic_script
[+] [-] MengerSponge|5 years ago|reply
Of course, we're taught to parse that as multiple discrete letters from an early age, so we don't get confused :)
[+] [-] dooglius|5 years ago|reply
[+] [-] colejohnson66|5 years ago|reply
[+] [-] kyberias|5 years ago|reply
[+] [-] Techyrack|5 years ago|reply
[deleted]
[+] [-] mannykannot|5 years ago|reply
[+] [-] elliekelly|5 years ago|reply
[+] [-] Kaze404|5 years ago|reply