Drag an emoji family with a string size of 11 into an input with maxlength=10

[+] SquareWheel|3 years ago|reply

> Except in Safari, whose maxlength implementation seems to treat all emoji as length 1. This means that the maxlength attribute is not fully interoperable between browsers.

No, it's definitely not. You can read the byte length more directly in JS, and use that to inform if more text is allowed or not.

    const encoder = new TextEncoder();
    const currentBytes = encoder.encode(inputStr).byteLength;

But the maxlength attribute is at best an approximation. Don't rely on it for things like limiting length for database fields (not that you should trust the client anyway).

[+] ASalazarMX|3 years ago|reply

Apple's take seems more reasonable. When a user uses an emoji, they think of it as a single symbol, they don't care of the Unicode implementation, or its length in bytes. IMO this should be the standard, and all other interpretations are a repeat of the transition from ASCII to Unicode.

[+] RobotToaster|3 years ago|reply

>Don't rely on it for things like limiting length for database fields (not that you should trust the client anyway).

Do database engines all agree on edge cases like this?

[+] jonhohle|3 years ago|reply

It seems odd to suggest the bug is with Safari. Normal humans (and even most developers!) don’t care that the byte length of an emoji in a particular encoding that may or may not be under their control defines the maximum “characters” in a text box (character used to define a logical collection of code points each of which may fit into one or multiple bytes).

[+] twic|3 years ago|reply

It's a bug with Safari because the HTML spec defines maxlength as applying to the number of UTF-16 code units [1]:

> Constraint validation: If an element has a maximum allowed value length, its dirty value flag is true, its value was last changed by a user edit (as opposed to a change made by a script), and the length of the element's API value is greater than the element's maximum allowed value length, then the element is suffering from being too long.

Where "the length of" is a link to [2]:

> A string’s length is the number of code units it contains.

And "code units" is a link to [3]:

> A string is a sequence of unsigned 16-bit integers, also known as code units.

I agree with your implied point that this is a questionable definition, though!

[1] https://html.spec.whatwg.org/multipage/form-control-infrastr...

[2] https://infra.spec.whatwg.org/#string-length

[3] https://infra.spec.whatwg.org/#code-unit

[+] biftek|3 years ago|reply

Yea the authors conclusion is flawed. If I enter an emoji and a different one appears I'm just going to assume your website is broken. Safari is in the right here.

[+] Manfred|3 years ago|reply

Any attempt at defining what people think of as characters is going to fail because of how many exceptions our combined writing systems have. See: codepoints, characters, grapheme clusters.

A good starting place is UAX #29: https://www.unicode.org/reports/tr29/tr29-41.html

However, the gold standard in UI implementations is that you never break the user's input.

[+] rockwotj|3 years ago|reply

Personally I think this is much easier to digest than the above link: https://manishearth.github.io/blog/2017/01/14/stop-ascribing...

It starts by breaking down common Unicode assumptions folks have

[+] Dylan16807|3 years ago|reply

> See: codepoints, characters, grapheme clusters.

People don't think about code points and they definitely don't think about code units.

What does "character" mean?

Grapheme clusters aren't perfect but they're far ahead of code whatevers.

[+] a_c|3 years ago|reply

Edit 2: Actually it took 7 backspace to obliviate the whole family. Tough one.

Edit: Oops. I was trying to type emoji into HN comment. apparently it is not supported.

I never knew that [emoji of family of 4] takes 5 backspace to eliminate. It goes from [emoji of family of 4] to [emoji of family of 3] to [father and mother] to [father] to [father]. Somehow [father] can take double the (key)punch than the rest of his family.

[+] WirelessGigabit|3 years ago|reply

There is a ZWJ after father. That would explain the double punch.

[+] lapcat|3 years ago|reply

The bigger problem is that web browsers silently truncate strings over the maxlength. This behavior is particularly nasty for secure password fields.

[+] Wowfunhappy|3 years ago|reply

If anyone would like to try this out:

https://codepen.io/Wowfunhappy/pen/dyqpMXO?editors=1000

[+] waltbosz|3 years ago|reply

I forked your code and made it a bit more fun. Presenting: the Magical Shrinking Emoji Family.

https://codepen.io/waltbosz/pen/zYJKBEE

[+] rom-antics|3 years ago|reply

Edge cases like this will just get more common as unicode keeps getting more complex. There was a fun slide in this talk[1] that suggests unicode might be turing complete due to its case folding rules.

I miss when Unicode was just a simple list of codepoints. (Get off my lawn)

[1]: https://seriot.ch/resources/talks_papers/20171027_brainfuck_...

[+] WorldMaker|3 years ago|reply

These "edge cases" have always existed in Unicode. Languages with ZWJ needs have existed in Unicode since the beginning. That emoji put a spotlight on this for especially English-speaking developers with assumptions that language encodings are "simple", is probably one of the best things about the popularity of emoji.

[+] Dylan16807|3 years ago|reply

Unicode always had combining characters. Is this so different from accent marks disappearing? And the Hangul pieces were there from 1.0/1.1.

[+] GuB-42|3 years ago|reply

I think we need some kind of standard "Unicode-light" with limitations to allow it to be used on low specs hardware and without weird edge cases like this. A bit like video codecs that have "profiles" which are limitations you can adhere to to avoid overwhelming low end hardware.

It wouldn't be "universal", but enough to write in the most commonly used languages, and maybe support a few, single codepoint special characters and emoji.

[+] kgwxd|3 years ago|reply

I’ve worked on large data entry forms for decades. I stopped using maxlength a long time ago because of this. Entry should be free form, truncation is unexpected behavior. Validation should catch the limitation and never manipulate what the user entered. People paste not realizing text got cutoff and information gets lost. They type without looking at the screen. I’ve even seen it used on a password setting form but not on the sign in form so anyone that thinks they set a password larger than the limit thinks it succeeded but login fails and even tech support gets baffled for hours.

We already know client side limits aren’t enough and server validation is required anyway. Trying to cleverly “help” entry usually just causes headaches for users and devs. Dynamically showing/hiding, enabling/disabling, focus jumping, auto formatting, updating other fields based on other values, is usually confusing for the user and more difficult to program correctly. Just show it all, allow all entry and catch everything in the validation code, everyone will be happier.

[+] graypegg|3 years ago|reply

Odd, it makes sense on a technical level when it comes to these ZWJ characters, but hiding the implementation makes sense from Safari’s point of view. I’d actually prefer that as a UNIVERSAL standard, visible symbols vs characters. (When it comes to UI)

But I can also imagine this is problematic when setting validation rules else where and now there’s a subtle foot gun buried in most web forms.

I guess the thing to learn here is to not rely on maxlength=

[+] zuhsetaqi|3 years ago|reply

The thing every developer should have learned from the beginning is to never rely on anything that comes from user input or client

[+] ryandrake|3 years ago|reply

Seems like an obvious root cause for these sorts of things are languages (and developers) who can't or won't differentiate between "byte length" and "string length". Joel warned us all about this 20 years (!!) ago[1], and we're still struggling with it.

1: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

[+] rockwotj|3 years ago|reply

String length isn't even a well defined term. Do you mean codepoint length? Or the number of graphemes?

[+] raffy|3 years ago|reply

I've created a bunch of Unicode tools for Ethereum Name Service (ENS).

https://adraffy.github.io/ens-normalize.js/test/resolver.htm...

Also:

https://adraffy.github.io/ens-normalize.js/test/emoji.html

https://adraffy.github.io/ens-normalize.js/test/chars.html

https://adraffy.github.io/ens-normalize.js/test/confused.htm...

https://adraffy.github.io/punycode.js/test/demo.html

Here is a video of another failure case: