> Except in Safari, whose maxlength implementation seems to treat all emoji as length 1. This means that the maxlength attribute is not fully interoperable between browsers.
No, it's definitely not. You can read the byte length more directly in JS, and use that to inform if more text is allowed or not.
const encoder = new TextEncoder();
const currentBytes = encoder.encode(inputStr).byteLength;
But the maxlength attribute is at best an approximation. Don't rely on it for things like limiting length for database fields (not that you should trust the client anyway).
Apple's take seems more reasonable. When a user uses an emoji, they think of it as a single symbol, they don't care of the Unicode implementation, or its length in bytes. IMO this should be the standard, and all other interpretations are a repeat of the transition from ASCII to Unicode.
It seems odd to suggest the bug is with Safari. Normal humans (and even most developers!) don’t care that the byte length of an emoji in a particular encoding that may or may not be under their control defines the maximum “characters” in a text box (character used to define a logical collection of code points each of which may fit into one or multiple bytes).
It's a bug with Safari because the HTML spec defines maxlength as applying to the number of UTF-16 code units [1]:
> Constraint validation: If an element has a maximum allowed value length, its dirty value flag is true, its value was last changed by a user edit (as opposed to a change made by a script), and the length of the element's API value is greater than the element's maximum allowed value length, then the element is suffering from being too long.
Where "the length of" is a link to [2]:
> A string’s length is the number of code units it contains.
And "code units" is a link to [3]:
> A string is a sequence of unsigned 16-bit integers, also known as code units.
I agree with your implied point that this is a questionable definition, though!
Yea the authors conclusion is flawed. If I enter an emoji and a different one appears I'm just going to assume your website is broken. Safari is in the right here.
Any attempt at defining what people think of as characters is going to fail because of how many exceptions our combined writing systems have. See: codepoints, characters, grapheme clusters.
Edit 2: Actually it took 7 backspace to obliviate the whole family. Tough one.
Edit: Oops. I was trying to type emoji into HN comment. apparently it is not supported.
I never knew that [emoji of family of 4] takes 5 backspace to eliminate.
It goes from [emoji of family of 4] to [emoji of family of 3] to [father and mother] to [father] to [father]. Somehow [father] can take double the (key)punch than the rest of his family.
Edge cases like this will just get more common as unicode keeps getting more complex. There was a fun slide in this talk[1] that suggests unicode might be turing complete due to its case folding rules.
I miss when Unicode was just a simple list of codepoints. (Get off my lawn)
These "edge cases" have always existed in Unicode. Languages with ZWJ needs have existed in Unicode since the beginning. That emoji put a spotlight on this for especially English-speaking developers with assumptions that language encodings are "simple", is probably one of the best things about the popularity of emoji.
I think we need some kind of standard "Unicode-light" with limitations to allow it to be used on low specs hardware and without weird edge cases like this. A bit like video codecs that have "profiles" which are limitations you can adhere to to avoid overwhelming low end hardware.
It wouldn't be "universal", but enough to write in the most commonly used languages, and maybe support a few, single codepoint special characters and emoji.
I’ve worked on large data entry forms for decades. I stopped using maxlength a long time ago because of this. Entry should be free form, truncation is unexpected behavior. Validation should catch the limitation and never manipulate what the user entered. People paste not realizing text got cutoff and information gets lost. They type without looking at the screen. I’ve even seen it used on a password setting form but not on the sign in form so anyone that thinks they set a password larger than the limit thinks it succeeded but login fails and even tech support gets baffled for hours.
We already know client side limits aren’t enough and server validation is required anyway. Trying to cleverly “help” entry usually just causes headaches for users and devs. Dynamically showing/hiding, enabling/disabling, focus jumping, auto formatting, updating other fields based on other values, is usually confusing for the user and more difficult to program correctly. Just show it all, allow all entry and catch everything in the validation code, everyone will be happier.
Odd, it makes sense on a technical level when it comes to these ZWJ characters, but hiding the implementation makes sense from Safari’s point of view. I’d actually prefer that as a UNIVERSAL standard, visible symbols vs characters. (When it comes to UI)
But I can also imagine this is problematic when setting validation rules else where and now there’s a subtle foot gun buried in most web forms.
I guess the thing to learn here is to not rely on maxlength=
Seems like an obvious root cause for these sorts of things are languages (and developers) who can't or won't differentiate between "byte length" and "string length". Joel warned us all about this 20 years (!!) ago[1], and we're still struggling with it.
[+] [-] SquareWheel|3 years ago|reply
No, it's definitely not. You can read the byte length more directly in JS, and use that to inform if more text is allowed or not.
But the maxlength attribute is at best an approximation. Don't rely on it for things like limiting length for database fields (not that you should trust the client anyway).[+] [-] ASalazarMX|3 years ago|reply
[+] [-] RobotToaster|3 years ago|reply
Do database engines all agree on edge cases like this?
[+] [-] jonhohle|3 years ago|reply
[+] [-] twic|3 years ago|reply
> Constraint validation: If an element has a maximum allowed value length, its dirty value flag is true, its value was last changed by a user edit (as opposed to a change made by a script), and the length of the element's API value is greater than the element's maximum allowed value length, then the element is suffering from being too long.
Where "the length of" is a link to [2]:
> A string’s length is the number of code units it contains.
And "code units" is a link to [3]:
> A string is a sequence of unsigned 16-bit integers, also known as code units.
I agree with your implied point that this is a questionable definition, though!
[1] https://html.spec.whatwg.org/multipage/form-control-infrastr...
[2] https://infra.spec.whatwg.org/#string-length
[3] https://infra.spec.whatwg.org/#code-unit
[+] [-] biftek|3 years ago|reply
[+] [-] Manfred|3 years ago|reply
A good starting place is UAX #29: https://www.unicode.org/reports/tr29/tr29-41.html
However, the gold standard in UI implementations is that you never break the user's input.
[+] [-] rockwotj|3 years ago|reply
It starts by breaking down common Unicode assumptions folks have
[+] [-] Dylan16807|3 years ago|reply
People don't think about code points and they definitely don't think about code units.
What does "character" mean?
Grapheme clusters aren't perfect but they're far ahead of code whatevers.
[+] [-] a_c|3 years ago|reply
Edit: Oops. I was trying to type emoji into HN comment. apparently it is not supported.
I never knew that [emoji of family of 4] takes 5 backspace to eliminate. It goes from [emoji of family of 4] to [emoji of family of 3] to [father and mother] to [father] to [father]. Somehow [father] can take double the (key)punch than the rest of his family.
[+] [-] WirelessGigabit|3 years ago|reply
[+] [-] lapcat|3 years ago|reply
[+] [-] Wowfunhappy|3 years ago|reply
https://codepen.io/Wowfunhappy/pen/dyqpMXO?editors=1000
[+] [-] waltbosz|3 years ago|reply
https://codepen.io/waltbosz/pen/zYJKBEE
[+] [-] rom-antics|3 years ago|reply
I miss when Unicode was just a simple list of codepoints. (Get off my lawn)
[1]: https://seriot.ch/resources/talks_papers/20171027_brainfuck_...
[+] [-] WorldMaker|3 years ago|reply
[+] [-] Dylan16807|3 years ago|reply
[+] [-] GuB-42|3 years ago|reply
It wouldn't be "universal", but enough to write in the most commonly used languages, and maybe support a few, single codepoint special characters and emoji.
[+] [-] kgwxd|3 years ago|reply
We already know client side limits aren’t enough and server validation is required anyway. Trying to cleverly “help” entry usually just causes headaches for users and devs. Dynamically showing/hiding, enabling/disabling, focus jumping, auto formatting, updating other fields based on other values, is usually confusing for the user and more difficult to program correctly. Just show it all, allow all entry and catch everything in the validation code, everyone will be happier.
[+] [-] graypegg|3 years ago|reply
But I can also imagine this is problematic when setting validation rules else where and now there’s a subtle foot gun buried in most web forms.
I guess the thing to learn here is to not rely on maxlength=
[+] [-] zuhsetaqi|3 years ago|reply
[+] [-] ryandrake|3 years ago|reply
1: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...
[+] [-] rockwotj|3 years ago|reply
[+] [-] raffy|3 years ago|reply
https://adraffy.github.io/ens-normalize.js/test/resolver.htm...
Also:
https://adraffy.github.io/ens-normalize.js/test/emoji.html
https://adraffy.github.io/ens-normalize.js/test/chars.html
https://adraffy.github.io/ens-normalize.js/test/confused.htm...
https://adraffy.github.io/punycode.js/test/demo.html
Here is a video of another failure case:
https://twitter.com/adraffy/status/1629262969581473792
[+] [-] test6554|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] hn_throwaway_99|3 years ago|reply
[+] [-] ok123456|3 years ago|reply