String length functions for single emoji characters evaluate to greater than 1

[+] patrickas|5 years ago|reply

Raku seems to be more correct (DWIM) in this regard than all the examples given in the post...

  my \emoji = "\c[FACE PALM]\c[EMOJI MODIFIER FITZPATRICK TYPE-3]\c[ZERO WIDTH JOINER]\c[MALE SIGN]\c[VARIATION SELECTOR-16]";

  #one character
  say emoji.chars; # 1 
  #Five code points
  say emoji.codes; # 5

  #If I want to know how many bytes that takes up in various encodings...
  say emoji.encode('UTF8').bytes; # 17 bytes 
  say emoji.encode('UTF16').bytes; # 14 bytes

Edit: Updated to use the names of each code point since HN cannot display the emoji

[+] cjm42|5 years ago|reply

And if you try to say emoji.length, you'll get an error:

No such method 'length' for invocant of type 'Str'. Did you mean any of these: 'codes', 'chars'?

Because as the article points out, the "length" of a string is an ambiguous concept these days.

[+] duskwuff|5 years ago|reply

You can represent it as a sequence of escapes. If Raku handles this the same way as Perl5, it should be:

    $a = "\N{FACE PALM}\N{EMOJI MODIFIER FITZPATRICK TYPE-3}\N{ZERO WIDTH JOINER}\N{MALE SIGN}\N{VARIATION SELECTOR-16}";

[+] ChrisSD|5 years ago|reply

In summary the two useful measures of unicode length are:

* Number of native (preferably UTF-8) code units

* Number of extended grapheme clusters

The first is useful for basic string operations. The second is good for telling you what a user would consider a "character".

[+] wodenokoto|5 years ago|reply

> The first is useful for basic string operations.

Reversing a string is what I would consider basic string operations, but I also expect it not to break emoji and other grapheme clusters.

Nothing is easy.

[+] samatman|5 years ago|reply

While I agree with this assessment, it means that these are the basic string operations:

* Passing strings around

* Reading and writing strings from files / sockets

* Concatenation

Anything else should reckon with extended grapheme clusters, whether it does so or not. Even proper upcasing is impossible without knowing, for one example, whether or not the string is in Turkish.

[+] dahfizz|5 years ago|reply

> Number of native (preferably UTF-8) code units

> The first is useful for basic string operations

Can you expand on this? I don't see why knowing the number of code units would be useful except when calculating the total size of the string to allocate memory. Basic string operations, such as converting to uppercase, would operate on codepoints, regardless of how many code units are used to encode that codepoint.

Converting 'Á' to 'á', for example, is an operation on one codepoint but multiple code units.

[+] emergie|5 years ago|reply

Contemplate 2 methods of writing a 'dz' digraph

  dz - \u0064\u007a, 2 basic latin block codepoints
  DZ - \u0044\u005a
  Dz - \u0044\u007a
  
  ǳ - \u01f3, lowercase, single codepoint
  Ǳ - \u01f1, uppercase
  ǲ - \u01f2, TITLECASE!

What happens if you try to express dż or dź from polish orthography?

You can use

  dż - \u0064\u017c - d followed by 'LATIN SMALL LETTER Z WITH DOT ABOVE'
  dż - \u0064\u007a\u0307 - d followed by z, followed by combining diacritical dot above
  ǳ̇ - \u01f3\u0307 - ǳ with combining diacritical dot above

  multiplied by uppercase and titlecase forms

In polish orthography dz digraph is considered 2 letters, despite being only one sound (głoska). I'm not so sure about macedonian orthography, they might count it as one thing.

Medieval ß is a letter/ligature that was created from ſʒ - that is a long s and a tailed z. In other words it is a form of 'sz' digraph. Contemporarily it is used only in german orthography.

How long is ß?

By some rules uppercasing ß yields SS or SZ. Should uppercasing or titlecasing operations change length of a string?

[+] masklinn|5 years ago|reply

> The first is useful for basic string operations.

The only thing it's useful for is sizing up storage. It does nothing for "basic string operations" unless "basic string operations" are solely 7-bit ascii manipulations.

[+] bikeshaving|5 years ago|reply

Why do people prefer UTF-8 coordinates? While for storage I think we should use UTF-8, when working with strings live it’s just so much easier to use UTF-16 because it’s predictable: 1 unit for the basic plane and 2 for everything else (the multi-character emoji and modifier stuff aside). I am probably biased because I mostly use and think about DOMStrings which are always UTF-16 but I’m not sure why people who use languages which are more flexible about string representations than JavaScript would not also appreciate this kind of regularity.

[+] unknown|5 years ago|reply

[deleted]

[+] SamBam|5 years ago|reply

After having read (quite interesting) the article, I still don't quite get the subtitle:

> 'But It’s Better that "[emoji]".len() == 17 and Rather Useless that len("[emoji]") == 5'

It sounds like it's just whether you're counting UTF-8/-16/-32 units. Does the article explain why one is worse and one is "rather useless?"

[+] dahfizz|5 years ago|reply

I agree. The author talks a little bit about which UTF encoding makes sense in which situation, but they never make an argument about which result from len is correct.

My two cents is that string length should always be the number of Unicode codepoints in the string, regardless of encoding. If you want the byte length, I'm sure there is a sizeof equivalent for your language.

When we call len() on an array, we want the number of objects in the array. When we iterate over an array, we want to deal with one object from the array at a time. We don't care how big an object is. A large object shouldn't count for more than 1 when calculating the array length.

Similarly, a unicode codepoint is the fundamental object in a string. The byte-size of a codepoint does not affect the length of the string. It makes no sense to iterate over each byte in a unicode string, because a byte on its own is completely meaningless. len() should be the number of objects we can iterate over, just like in arrays.

[+] beaconstudios|5 years ago|reply

perhaps the issue is that there's a canonical "length" at all. It would make more sense to me to have different types of length depending on which measure you're after, like Swift apparently has but without the canonical `.count`. Because when there's multiple interpretations of a thing's length, when you ask for "length" you're leaving the developers to resolve the ambiguity and I'm of the firm belief that developers shouldn't consider themselves psychic.

[+] anaerobicover|5 years ago|reply

The main reason, I think, that Swift strings have `count` is that they conform to the `Collection` protocol. Swift's stdlib has a pervasive "generic programming" philosophy, enabled by various protocol heirarchies.

So, given that the property is required to be present, some semantic or the other had to be chosen. I am sure there were debates when they were writing `String` about which one was proper.

[+] j1elo|5 years ago|reply

If you're working with an image, you might have an Image class, that has a Image.width and a Image.height in pixels, regardless of how these pixels are laid out in memory (depends on encoding, colorspace, etc). Most if not all methods operate on these pixels, e.g. cropping, scaling, color filtering, etc. Then, there might be a Image.memory property that provides acces to the underlying, actual bytes.

I don't understand why the same is not the obvious solution for strings. len("🇨🇦") should be 1, because we as humans read one emoji as one character, regardless of the memory layout behind. Most if not all methods operate on characters (reversing, concatenating, indexing).

And then, if you need access to the low level data, the String.memory would contain the actual bytes... which would be different depending on the actual text encoding.

[+] dnautics|5 years ago|reply

The number of bytes necessary is incredibly important for security reasons. It's arguably better to make the number of bytes be the primary value and have a secondary lookup be the number of glyphs.

To be fair, some systems distinguish between size and length (with size expected to be O(1) and length allowed to be up to O(n)). For those systems proceed as parent.

[+] throwawayffffas|5 years ago|reply

Python 3's approach is clearly the best. Because it focuses on the problem at hand, unicode codepoints. A string in python is a sequence of unicode codepoints, it's length should be the number of codepoints in the sequence, it has nothing to do with bytes.

To draw an absurd parallel "[emoji]".len() == 17 is equivalent to [1,2].len() == 8 (2 32 bit integers)

In my opinion the most useful result in the case the article describes is 5. There should of course be a way to get 1 (the number of extended graphemes), but it should not be the strings "length".

[+] nerdponx|5 years ago|reply

Don't Swift and Go support iterating over graphemes? Edit: yes, Swift is mentioned at the bottom of the article.

It'd be great to have a function for that in other scripting languages like Python, Ruby, etc.

There was an interesting sub-thread here on HN a while ago about how a string-of-codepoints is just as bad as a string-of-bytes (subject to truncation, incorrect iteration, incorrect length calculation) and that we should just have string-of-bytes if we can't have string-of-graphemes. I don't agree, but some people felt very strongly about it.

[+] banthar|5 years ago|reply

Defining string as a sequence of unicode codepoints is the mistake.

Nobody ever cares about unicode codepoints. You either want the number of bytes or the width of the string on screen.

UTF-32 codepoints waste space and give you neither.

[+] anaerobicover|5 years ago|reply

But what do you do when you're processing a string with codepoints that compose into one user-visible glyph?

    >>> len("🇨🇦")
    2

[+] eximius|5 years ago|reply

Gah, I wish I was done with my crate so I could point to lovely formatted documentaion...

But I believe this can be handled explicitly and well and I'm trying to do that in my fuzzy string matching library based on fuzzywuzzy.

https://github.com/logannc/fuzzywuzzy-rs/pull/26/files

[+] sanedigital|5 years ago|reply

This may be the best tech writeup I've ever seen. Super in-depth, easily readable. Really well done.

[+] donaldihunter|5 years ago|reply

With clickbaity misrepresentations of correctness, unfortunately.

[+] hateful|5 years ago|reply

If you do a formula in Google Sheets and it contains an emoji, this comes into play. For example, if the first character is an emoji and you want to reference it you need to do =LEFT(A1, 2) - and not =LEFT(A1, 1)

[+] bob1029|5 years ago|reply

We need to deal with mountains of user input from mobile devices and the worst we have run into is "smart" quotes from iOS devices hosing older business systems. Our end users are pretty good about not typing pirate flags into customer name fields.

I still haven't run into a situation where I need to count the logical number of glyphs in a string. Any system that we need to push a string into will be limiting things on the basis of byte counts, so a .Length check is still exactly what we need.

Does this cause trouble for anyone else?

[+] ravi-delia|5 years ago|reply

I have used glyph counting a handful of times, mostly for width computing before I learned there were better ways. I'm 100% sure my logic was just waiting to fail on any input that didn't use the Latin alphabet.

[+] Lightbody|5 years ago|reply

Relevant related writeup that was just on HN the other day (I think... or I saw it somewhere else):

https://tonsky.me/blog/emoji/

Edit: confirmed, it was on HN just half a day prior. Probably why this article arrived too. Just surprised nobody reference this other one in this thread :)

[+] waterside81|5 years ago|reply

I like how Go handles this case by providing the utf8.DecodeRune/utf8.DecodeRuneInString functions which return each individual "rune" as well as the size in code points.

Coming from python2, for me it was the first time I saw a language handle unicode so gracefully.

[+] maxnoe|5 years ago|reply

https://pypi.org/project/grapheme/

[+] rurban|5 years ago|reply

Rusts graphemes.count sounds nice.

In C land it is called ucwidth (libunistring). Length is too arbitrary. Byte? Unicode points? Utf-8 length?

[+] tzs|5 years ago|reply

5 in Perl, 17 in PHP.

[+] Rizz|5 years ago|reply

17 in php with strlen, which is defined as counting bytes. 1 when you use grapheme_strlen. mb_strlen and iconv_strlen return 5, and 5 is rather useless as the article says.

[+] colejohnson66|5 years ago|reply

I was wondering what the title meant. Turns out HN’s emoji stripper screwed with the title.

It’s asking why a skin toned (not yellow) facepalm emoji’s length is 7 when the user perceives it as a single character.

Tangent: Emojis are an interesting topic in regards to programming. They challenged the “rule” held by programmers that every character is a single codepoint (of 8 or 16 bits). So, str[1234] would get me the 1235th character, but it’s actually the 1235th byte. UTF-8 threw a wrench in that, but many programmers went along ignoring reality.

Sadly, preexisting languages such as Arabic weren’t warning enough in regards to line breaking. As in: an Arabic “character”[a] can change its width depending on if there’s a “character” before or after it (which gives it its cursive-like look). So, a naive line breaking routine could cause bugs if it tried to break in the middle of a word. Tom Scott has a nice video on it that was filmed when the “effective power” crash was going around.[0]

[0]: https://youtu.be/hJLMSllzoLA

[a]: Arabic script isn’t technically an alphabet like Latin characters are. It’s either an abugida or abjad (depending on who you ask). See Wikipedia: https://en.wikipedia.org/wiki/Arabic_script

[+] MengerSponge|5 years ago|reply

Interesting. Also, many of us use fonts with ligatures, which render as a single character (for example: tt, ti, ff, Th, ffi)

Of course, we're taught to parse that as multiple discrete letters from an early age, so we don't get confused :)

[+] dooglius|5 years ago|reply

I think this title needs an exemption from HN's no-emoji rule

[+] colejohnson66|5 years ago|reply

I’m curious: is the no-emoji rule a rule that happens to block emoji or a hardcoded rule? What I mean is: emojis (in UTF-16) have to use surrogate pairs because all their code points are in the U+1xxxx plane. Is the software just disallowing any characters needing two code points to encode (which would include emoji)? Or is it specifically singling out the emoji blocks?

[+] kyberias|5 years ago|reply

Why is there such a rule?

[+] Techyrack|5 years ago|reply

[deleted]

[+] mannykannot|5 years ago|reply

In my curmudgeonly way, I suggest zero is the correct value, to reflect the information content of almost all emoji use.

[+] elliekelly|5 years ago|reply

Counterpoint: Appending U+1F609, U+1F61C, or U+1F643 to the end of your comment would have added immense value in communicating your point because it would have lightened the otherwise harsh tone of your words :)

[+] Kaze404|5 years ago|reply

With that metric the length of your comment should be zero as well. Probably mine also :)

127 comments