I love how the title of this submission is changing every time I come back to HN.
At first there was an empty space between the double quotes. This made me click and read the article because it was surprising that the length of a space would be 7.
Then the actual emoji appeared and the title finally made sense.
Now I see escaped \u{…} characters spelled out and it’s just ridiculous.
Can’t wait to come back tomorrow to see what it will be then.
I think that string length is one of those things that people (including me) don't realise they never actually want. In a production system, I have never actually wanted string length. I have wanted:
- Number of bytes this will be stored as in the DB
- Number of monospaced font character blocks this string will take up on the screen
- Number of bytes that are actually being stored in memory
"String length" is just a proxy for something else, and whenever I'm thinking shallowly enough to want it (small scripts, mostly-ASCII, mostly-English, mostly-obvious failure modes, etc) I like grapheme cluster being the sensible default thing that people probably expect, on average.
Taking this one step further -- there's no such thing as the context-free length of a string.
Strings should be thought of more like opaque blobs, and you should derive their length exclusively in the context in which you intend to use it. It's an API anti-pattern to have a context-free length property associated with a string because it implies something about the receiver that just isn't true for all relevant usages and leads you to make incorrect assumptions about the result.
Refining your list, the things you usually want are:
- Number of bytes in a given encoding when saving or transmitting (edit: or more generally, when serializing).
- Number of code points when parsing.
- Number of grapheme clusters for advancing the cursor back and forth when editing.
- Bounding box in pixels or points for display with a given font.
Context-free length is something we inherited from ASCII where almost all of these happened to be the same, but that's not the case anymore. Unicode is better thought of as compiled bytecode than something you can or should intuit anything about.
It's like asking "what's the size of this JPEG." Answer is it depends, what are you trying to do?
ASCII is very convenient when it fits in the solution space (it’d better be, it was designed for a reason), but in the global international connected computing world it doesn’t fit at all. The problem is all the tutorials, especially low level ones, assume ASCII so 1) you can print something to the console and 2) to avoid mentioning that strings are hard so folks don’t get discouraged.
Notably Rust did the correct thing by defining multiple slightly incompatible string types for different purposes in the standard library and regularly gets flak for it.
> Number of monospaced font character blocks this string will take up on the screen
Even this has to deal with the halfwidth/fullwidth split in CJK. Even worse, Devanagari has complex rendering rules that actually depend on font choices. AFAIU, the only globally meaningful category here is rendered bounding box, which is obviously font-dependent.
But I agree with the general sentiment. What we really about how much space these text blobs take up, whether that be in a DB, in memory, or on the screen.
It's definitely worth thinking about the real problem, but I wouldn't say it's never helpful.
The underlying issue is unit conversion. "length" is a poor name because it's ambiguous. Replacing "length" with three functions - "lengthInBytes", "lengthInCharacters", and "lengthCombined" - would make it a lot easier to pick the right thing.
> Number of monospaced font character blocks this string will take up on the screen
To predict the pixel width of a given text, right?
One thing I ran into is that despite certain fonts being monospace, characters from different Unicode blocks would have unexpected lengths. Like I'd have expected half-width CJK letters to render to the same pixel dimensions as Latin letters do, but they don't. It's ever so slightly off. Same with full-width CJK letters vs two Latin letters.
I'm not sure if this is due to some font fallback. I'd have expected e.g. VS Code to be able to be able to render Japanese and English monospace in an aligned way without any fallbacks. Maybe once I have energy again to waste on this I'll look into it deeper.
Very true. Rust’s handling of strings was an eye opener for me.
Seemed awkward but I eventually realized I rarely cared about number of characters. Even when dealing with substrings, I really only cared a means to describe “stuff” before/after not literal indices.
Counting Unicode characters is actually a disservice.
FWIW, I frequently want the string length. Not for anything complicated, but our authors have ranges of characters they are supposed to stay in. Luckily no one uses emojis or weird unicode symbols, so in practice there’s no problem getting the right number by simply ignoring all the complexities.
How about for iterating every character in a string in order to find a specific character combination? I need (or the iterator needs) to know the length of the string and what the boundaries of each characters are.
What about implementing text algorithms like prefix search or a suffix tree to mention the simplest ones? Don't you need a string length at various points there?
I actually want string length. Just give me the length of a word. My human brain wants a human way to think about problems. While programming I never think about bytes.
I have never wanted any of the things you said. I have, on the other hand, always wanted the string length. I'm not saying that we shouldn't have methods like what you state, we should! But your statement that people don't actually want string length is untrue because it's overly broad.
I see where you're coming from, but I disagree on some specifics, especially regarding bytes.
Most people care about the length of a string in terms of the number of characters.
Treating it as a proxy for the number of bytes has been incorrect ever since UTF-8 became the norm (basically forever), and if you're dealing with anything beyond ASCII (which you really should, since East Asian users alone number in the billions).
Same goes to the "string width".
Yes, Unicode scalar values can combine into a single glyph and cause discrepancies, as the article mentions, but that is a much rarer edge case than simply handling non-ASCII text.
I have wanted string length many times in production systems for language processing. And it is perfectly fine as long as whatever you are using is consistent. I rarely care how many bytes an emoji actually is unless I'm worried about extreme efficiency in storage or how many monospace characters it uses unless I do very specific UI things. This blog is more of a cautionary tale what can happen if you unconsciously mix standards e.g. by using one in the backend and another in the frontend. But this is not a problem of string lengths per se, they are just one instance where modern implementations are all over the place.
Ok, we've put Man Facepalming with Light Skin Tone back up there. I failed to find a way to avoid it.
Is there a way to represent this string with escaped codepoints? It would be both amusing and in HN's plaintext spirit to do it that way in the title above, but my Unicode is weak.
There's an awful lot of text in here but I'm not seeing a coherent argument that Python's approach is the worst, despite the author's assertion. It especially makes no sense to me that counting the characters the implementation actually uses should be worse than counting UTF-16 code units, for an implementation that doesn't use surrogate pairs (and in fact only uses those code units to store out-of-band data via the "surrogateescape" error handler, or explicitly requested characters. N.B.: Lone surrogates are still valid characters, even though a sequence containing them is not a valid string.) JavaScript is compelled to count UTF-16 code units because it actually does use UTF-16. Python's flexible string representation is a space optimization; it still fundamentally represents strings as a sequence of characters, without using the surrogate-pair system.
> JavaScript is compelled to count UTF-16 code units because it actually does use UTF-16. Python's flexible string representation is a space optimization; it still fundamentally represents strings as a sequence of characters, without using the surrogate-pair system.
Python's flexible string system has nothing to do with this. Python could easily have had len() return the byte count, even the USV count, or other vastly more meaningful metrics than "5", whose unit is so disastrous I can't put a name to it. It's not bytes, it's not UTF-16 code units, it's not anything meaningful, and that's the problem. In particular, the USV count would have been made easy (O(1) easy!) by Python's flexible string representation.
You're handwaving it away in your writing by calling it a "character in the implementation", but what is a character? It's not a character in any sense a normal human would recognize — like a grapheme cluster — as I think if I asked a human "how many characters is <imagine this is man with skin tone face palming>?", they'd probably say "well, … IDK if it's really a character, but 1, I suppose?" …but "5" or "7"? Where do those even come from? An astute person might like "Oh, perhaps that takes more than one byte, is that it's size in memory?" Nope. Again: "character in the implementation" is a meaningless concept. We've assigned words to a thing to make it sound meaningful, but that is like definitionally begging the question here.
The article both argues that the "real" length from a user perspective is Extended Grapheme Clusters - and makes a case against using it, because it requires you to store the entire character database and may also change from one Unicode version to the next.
Therefore, people should use codepoints for things like length limits or database indexes.
But wouldn't this just move the "cause breakage with new Unicode version" problem to a different layer?
If a newer Unicode version suddenly defines some sequences to be a single grapheme cluster where there were several ones before and my database index now suddenly points to the middle of that cluster, what would I do?
Seems to me, the bigger problem is with backwards compatibility guarantees in Unicode. If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?
What do you mean by "use codepoints for ... database indexes"? I feel like you are drawing conclusions that the essay does not propose or support. (It doesn't say that you should use codepoints for length limits.)
> If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?
Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?
I’m guessing this got posted by one who saw my comment https://news.ycombinator.com/item?id=44976046 today, though coincidence is possible. (Previous mention of the URL was 7 months ago.)
Python does an exceptionally bad job. After dragging the community through a 15-year transition to Python 3 in order to "fix" Unicode, we ended up with support that's worse than in languages that simply treat strings as raw bytes.
Yeah I have no idea what is wrong with that. Python simply operates on arrays of codepoints, which are a stable representation that can be converted to a bunch of encodings including "proper" utf-8, as long as all codepoints are representable in that encoding. This also allows you to work with strings that contain arbitrary data falling outside of the unicode spectrum.
Python does it correctly and the results in that gist are expected. Characters are not grapheme clusters, and not every sequence of characters is valid. The ability to store unpaired surrogate characters is a feature: it would take extra time to validate this when it only really matters at encoding time. It also empowers the "surrogateescape" error handler, that in turn makes it possible to supply arbitrary bytes in command line arguments, even while providing strings to your program which make sense in the common case. (Not all sequences of bytes are valid UTF-8; the error handler maps the invalid bytes to invalid unpaired surrogates.) The same character counts are (correctly) observed in many other programming languages; there's nothing at all "exceptional" about Python's treatment.
It's not actually possible to "treat strings as raw bytes", because they contain more than 256 possible distinct symbols. They must be encoded; even if you assume an ecosystem-wide encoding, you are still using that encoding. But if you wish to work with raw sequences of bytes in Python, the `bytes` type is built-in and trivially created using a `b'...'` literal, or various other constructors. (There is also a mutable `bytearray` type.) These types now correctly behave as a sequence of byte (i.e., integer ranging 0..255 inclusive) values; when you index them, you get an integer. I have personal experience of these properties simplifying and clarifying my code.
Unicode was fixed (no quotation marks), with the result that you now have clearly distinct types that honour the Zen of Python principle that "explicit is better than implicit", and no longer get `UnicodeDecodeError` from attempting an encoding operation or vice-versa. (This problem spawned an entire family of very popular and very confused Stack Overflow Q&As, each with probably countless unrecognized duplicates.) As an added bonus, the default encoding for source code files changed to UTF-8, which means in practical terms that you can actually use non-English characters in your code comments (and even identifier names, with restrictions) now and have it just work without declaring an encoding (since your text editor now almost certainly assumes that encoding in 2025). This also made it possible to easily read text files as text in any declared encoding, and get strings as a result, while also having universal newline mode work, and all without needing to reach for `io` or `codecs` standard libraries.
The community was not so much "dragged through a 15-year transition"; rather, some members of the community spent as long as 15 (really 13.5, unless you count people continuing to try to use 2.7 past the extended EOL) years refusing to adapt to what was a clear bugfix of the clearly broken prior behaviour.
Stuff like this makes me so glad that in my world strings are ALWAYS ASCII and one char is always one byte. Unicode simply doesn't exist and all string manipulation can be done with a straightforward for loop or whatever.
Dealing with wide strings sounds like hell to me. Right up there with timezones. I'm perfectly happy with plain C in the embedded world.
That English can be well represented with ASCII may have contributed to America becoming an early computing powerhouse. You could actually do things like processing and sorting and doing case insensitive comparisons on data likes names and addresses very cheaply.
Worth giving Raku a shout out here... methods do what they say and you write what you mean. Really wish every other language would pinch the Str implementation from here, or at least the design.
$ raku
Welcome to Rakudo™ v2025.06.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2025.06.
[0] > " ".chars
1
[1] > " ".codes
5
[2] > " ".encode('UTF-8').bytes
17
[3] > " ".NFD.map(*.chr.uniname)
(FACE PALM EMOJI MODIFIER FITZPATRICK TYPE-3 ZERO WIDTH JOINER MALE SIGN VARIATION SELECTOR-16)
I haven't thought about this deeply, but it seems to me that the evolution of unicode has left it unparseable (into extended grapheme clusters, which I guess are "characters") in a forwards compatible way. If so, it seems like we need a new encoding which actually delimits these (just as utf-8 delimits code points). Then the original sender determines what is a grapheme, and if they don't know, who does?
I run one of the many online word counting tools (WordCounts.com) which also does character counts. I have noticed that even Google Docs doesn't seem to use grapheme counts and will produce larger than expected counts for strings of emoji.
If you want to see a more interesting case than emoji, check out Thai language. In Thai, vowels could appear before, after, above, below, or on many sides of the associated consonants.
Fascinating and annoying problem, indeed. In Java, the correct way to iterate over the characters (Unicode scalar values) of a string is to use the IntStream provided by String::codePoints (since Java 8), but I bet 99.9999% of the existing code uses 16-bit chars.
This does not fix the problem. The emoji consists of multiple Unicode characters (in turn represented 1:1 by the integer "code point" values). There is much more to it than the problem of surrogate pairs.
I'd disagree the number of unicode scalars is useless (in the case of python3), but it's a very interesting article nonetheless. Too bad unicode.org decided to break all the URLs in the table at the end.
Number of UTF-8 code units (17 in this case)
Number of UTF-16 code units (7 in this case)
Number of UTF-32 code units or Unicode scalar values (5 in this case)
Number of extended grapheme clusters (1 in this case)
We would not have this problem if we all agree to return number of bytes instead.
Edit: My mistake. There would still be inconsistency between different encoding. My point is, if we all decided to report number of bytes that string used instead number of printable characters, we would not have the inconsistency between languages.
>Number of extended grapheme clusters (1 in this case)
Only if you are using a new enough version of unicode. If you were using an older version it is more than 1. As new unicode updates come out, the number of grapheme clusters a string has can change.
The article nearly equivocates “Rather Useless” and “unambiguously the worst”. Python3 seems more coherent to me than the article's argument:
1. Python3 plainly distinguishes between a string and a sequence of bytes. The function `len`, as a built-in, gives the most straightforward count: for any set or sequence of items, it counts the number of these items.
2. For a sequence of bytes, it counts the number of bytes. Taking this face-palming half-pale male hodgepodge and encoding it according to UTF-8, we get 17 bytes. Thus `len("\U0001F926\U0001F3FC\u200D\u2642\uFE0F".encode(encoding = "utf-8")) == 17`.
3. After bytes, the most basic entities are Unicode code points. A Python3 string is a sequence of Unicode code points. So for a Python3 string, `len` should give the number of Unicode code points. Thus `len("\U0001F926\U0001F3FC\u200D\u2642\uFE0F") == 5`.
Anything more is and should be beyond the purview of the simple built-in `len`:
4. Grapheme clusters are complicated and nearly as arbitrary as code points, hence there are “legacy grapheme clusters” – the grapheme clusters of older Unicode versions, because they changed – and “tailored grapheme clusters”, which may be needed “for specific locales and other customizations”, and of course the default “extended grapheme clusters”, which are only “a best-effort approximation” to “what a typical user might think of as a “character”.” Cf. https://www.unicode.org/reports/tr29
Of course, there are very few use cases for knowing the number of code points, but are there really much more for the number (NB: the number) of grapheme clusters?
5. The space a sequence of code points will occupy on the screen: certainly useful but at least dependent on the typeface that will be used for rendering and hence certainly beyond the purview of a simple function.
Another little thing: The post mentions that tag sequences are only used for the flags of England, Scotland, and Wales. Those are the only ones that are standard (RGI), but because it's clear how the mechanism would work for other subnational entities, some systems support other ones, such as US state flags! I don't recommend using these if you want other people to be able to see them, but...
I really hate to rant on about this. But the gymnastics required to parse UTF-8 correctly are truly insane. Besides that we now see issues such as invisible glyph injection attacks etc cropping up all over the place due to this crappy so-called "standard". Maybe we should just to go back to the simplicity of ASCII until we can come up with with something better?
koliber|6 months ago
At first there was an empty space between the double quotes. This made me click and read the article because it was surprising that the length of a space would be 7.
Then the actual emoji appeared and the title finally made sense.
Now I see escaped \u{…} characters spelled out and it’s just ridiculous.
Can’t wait to come back tomorrow to see what it will be then.
lovich|6 months ago
rendx|6 months ago
DavidPiper|6 months ago
- Number of bytes this will be stored as in the DB
- Number of monospaced font character blocks this string will take up on the screen
- Number of bytes that are actually being stored in memory
"String length" is just a proxy for something else, and whenever I'm thinking shallowly enough to want it (small scripts, mostly-ASCII, mostly-English, mostly-obvious failure modes, etc) I like grapheme cluster being the sensible default thing that people probably expect, on average.
arcticbull|6 months ago
Strings should be thought of more like opaque blobs, and you should derive their length exclusively in the context in which you intend to use it. It's an API anti-pattern to have a context-free length property associated with a string because it implies something about the receiver that just isn't true for all relevant usages and leads you to make incorrect assumptions about the result.
Refining your list, the things you usually want are:
- Number of bytes in a given encoding when saving or transmitting (edit: or more generally, when serializing).
- Number of code points when parsing.
- Number of grapheme clusters for advancing the cursor back and forth when editing.
- Bounding box in pixels or points for display with a given font.
Context-free length is something we inherited from ASCII where almost all of these happened to be the same, but that's not the case anymore. Unicode is better thought of as compiled bytecode than something you can or should intuit anything about.
It's like asking "what's the size of this JPEG." Answer is it depends, what are you trying to do?
baq|6 months ago
Notably Rust did the correct thing by defining multiple slightly incompatible string types for different purposes in the standard library and regularly gets flak for it.
xelxebar|6 months ago
Even this has to deal with the halfwidth/fullwidth split in CJK. Even worse, Devanagari has complex rendering rules that actually depend on font choices. AFAIU, the only globally meaningful category here is rendered bounding box, which is obviously font-dependent.
But I agree with the general sentiment. What we really about how much space these text blobs take up, whether that be in a DB, in memory, or on the screen.
xg15|6 months ago
If I do s.charAt(x) or s.codePointAt(x) or s.substring(x, y), I'd like to know which values for x and y are valid and which aren't.
jlarocco|6 months ago
The underlying issue is unit conversion. "length" is a poor name because it's ambiguous. Replacing "length" with three functions - "lengthInBytes", "lengthInCharacters", and "lengthCombined" - would make it a lot easier to pick the right thing.
perching_aix|6 months ago
To predict the pixel width of a given text, right?
One thing I ran into is that despite certain fonts being monospace, characters from different Unicode blocks would have unexpected lengths. Like I'd have expected half-width CJK letters to render to the same pixel dimensions as Latin letters do, but they don't. It's ever so slightly off. Same with full-width CJK letters vs two Latin letters.
I'm not sure if this is due to some font fallback. I'd have expected e.g. VS Code to be able to be able to render Japanese and English monospace in an aligned way without any fallbacks. Maybe once I have energy again to waste on this I'll look into it deeper.
guappa|6 months ago
BobbyTables2|6 months ago
Seemed awkward but I eventually realized I rarely cared about number of characters. Even when dealing with substrings, I really only cared a means to describe “stuff” before/after not literal indices.
Counting Unicode characters is actually a disservice.
Semaphor|6 months ago
TZubiri|6 months ago
bluecalm|6 months ago
capitainenemo|6 months ago
zwnow|6 months ago
bigstrat2003|6 months ago
thrdbndndn|6 months ago
Most people care about the length of a string in terms of the number of characters.
Treating it as a proxy for the number of bytes has been incorrect ever since UTF-8 became the norm (basically forever), and if you're dealing with anything beyond ASCII (which you really should, since East Asian users alone number in the billions).
Same goes to the "string width".
Yes, Unicode scalar values can combine into a single glyph and cause discrepancies, as the article mentions, but that is a much rarer edge case than simply handling non-ASCII text.
sigmoid10|6 months ago
bstsb|6 months ago
for context, the actual post features an emoji with multiple unicode codepoints in between the quotes
dang|6 months ago
Is there a way to represent this string with escaped codepoints? It would be both amusing and in HN's plaintext spirit to do it that way in the title above, but my Unicode is weak.
cmeacham98|6 months ago
Phelinofist|6 months ago
unknown|6 months ago
[deleted]
zahlman|6 months ago
deathanatos|6 months ago
Python's flexible string system has nothing to do with this. Python could easily have had len() return the byte count, even the USV count, or other vastly more meaningful metrics than "5", whose unit is so disastrous I can't put a name to it. It's not bytes, it's not UTF-16 code units, it's not anything meaningful, and that's the problem. In particular, the USV count would have been made easy (O(1) easy!) by Python's flexible string representation.
You're handwaving it away in your writing by calling it a "character in the implementation", but what is a character? It's not a character in any sense a normal human would recognize — like a grapheme cluster — as I think if I asked a human "how many characters is <imagine this is man with skin tone face palming>?", they'd probably say "well, … IDK if it's really a character, but 1, I suppose?" …but "5" or "7"? Where do those even come from? An astute person might like "Oh, perhaps that takes more than one byte, is that it's size in memory?" Nope. Again: "character in the implementation" is a meaningless concept. We've assigned words to a thing to make it sound meaningful, but that is like definitionally begging the question here.
xg15|6 months ago
Therefore, people should use codepoints for things like length limits or database indexes.
But wouldn't this just move the "cause breakage with new Unicode version" problem to a different layer?
If a newer Unicode version suddenly defines some sequences to be a single grapheme cluster where there were several ones before and my database index now suddenly points to the middle of that cluster, what would I do?
Seems to me, the bigger problem is with backwards compatibility guarantees in Unicode. If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?
re|6 months ago
> If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?
Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?
dang|6 months ago
It’s not wrong that " ".length == 7 (2019) - https://news.ycombinator.com/item?id=36159443 - June 2023 (303 comments)
String length functions for single emoji characters evaluate to greater than 1 - https://news.ycombinator.com/item?id=26591373 - March 2021 (127 comments)
String Lengths in Unicode - https://news.ycombinator.com/item?id=20914184 - Sept 2019 (140 comments)
andy_xor_andrew|6 months ago
kazinator|6 months ago
TXR Lisp:
(Trust me when I say that the emoji was there when I edited the comment.)The second value takes work; we have to go through the code points and add up their UTF-8 lengths. The coded length is not cached.
pron|6 months ago
chrismorgan|6 months ago
• https://news.ycombinator.com/item?id=36159443 (June 2023, 280 points, 303 comments; title got reemojied!)
• https://news.ycombinator.com/item?id=26591373 (March 2021, 116 points, 127 comments)
• https://news.ycombinator.com/item?id=20914184 (September 2019, 230 points, 140 comments)
I’m guessing this got posted by one who saw my comment https://news.ycombinator.com/item?id=44976046 today, though coincidence is possible. (Previous mention of the URL was 7 months ago.)
program|6 months ago
osener|6 months ago
Some other fun examples: https://gist.github.com/ozanmakes/0624e805a13d2cebedfc81ea84...
mid-kid|6 months ago
zahlman|6 months ago
Python does it correctly and the results in that gist are expected. Characters are not grapheme clusters, and not every sequence of characters is valid. The ability to store unpaired surrogate characters is a feature: it would take extra time to validate this when it only really matters at encoding time. It also empowers the "surrogateescape" error handler, that in turn makes it possible to supply arbitrary bytes in command line arguments, even while providing strings to your program which make sense in the common case. (Not all sequences of bytes are valid UTF-8; the error handler maps the invalid bytes to invalid unpaired surrogates.) The same character counts are (correctly) observed in many other programming languages; there's nothing at all "exceptional" about Python's treatment.
It's not actually possible to "treat strings as raw bytes", because they contain more than 256 possible distinct symbols. They must be encoded; even if you assume an ecosystem-wide encoding, you are still using that encoding. But if you wish to work with raw sequences of bytes in Python, the `bytes` type is built-in and trivially created using a `b'...'` literal, or various other constructors. (There is also a mutable `bytearray` type.) These types now correctly behave as a sequence of byte (i.e., integer ranging 0..255 inclusive) values; when you index them, you get an integer. I have personal experience of these properties simplifying and clarifying my code.
Unicode was fixed (no quotation marks), with the result that you now have clearly distinct types that honour the Zen of Python principle that "explicit is better than implicit", and no longer get `UnicodeDecodeError` from attempting an encoding operation or vice-versa. (This problem spawned an entire family of very popular and very confused Stack Overflow Q&As, each with probably countless unrecognized duplicates.) As an added bonus, the default encoding for source code files changed to UTF-8, which means in practical terms that you can actually use non-English characters in your code comments (and even identifier names, with restrictions) now and have it just work without declaring an encoding (since your text editor now almost certainly assumes that encoding in 2025). This also made it possible to easily read text files as text in any declared encoding, and get strings as a result, while also having universal newline mode work, and all without needing to reach for `io` or `codecs` standard libraries.
The community was not so much "dragged through a 15-year transition"; rather, some members of the community spent as long as 15 (really 13.5, unless you count people continuing to try to use 2.7 past the extended EOL) years refusing to adapt to what was a clear bugfix of the clearly broken prior behaviour.
estimator7292|6 months ago
Dealing with wide strings sounds like hell to me. Right up there with timezones. I'm perfectly happy with plain C in the embedded world.
RcouF1uZ4gsC|6 months ago
Ultimatt|6 months ago
voidmain|6 months ago
umajho|6 months ago
[^2]: https://caniuse.com/mdn-javascript_builtins_intl_segmenter_s...
unknown|6 months ago
[deleted]
jfoster|6 months ago
If you want to see a more interesting case than emoji, check out Thai language. In Thai, vowels could appear before, after, above, below, or on many sides of the associated consonants.
tralarpa|6 months ago
zahlman|6 months ago
ivanjermakov|6 months ago
Aissen|6 months ago
unknown|6 months ago
[deleted]
unknown|6 months ago
[deleted]
mrheosuper|6 months ago
Number of UTF-8 code units (17 in this case) Number of UTF-16 code units (7 in this case) Number of UTF-32 code units or Unicode scalar values (5 in this case) Number of extended grapheme clusters (1 in this case)
We would not have this problem if we all agree to return number of bytes instead.
Edit: My mistake. There would still be inconsistency between different encoding. My point is, if we all decided to report number of bytes that string used instead number of printable characters, we would not have the inconsistency between languages.
curtisf|6 months ago
UTF-8 code units _are_ bytes, which is one of the things that makes UTF-8 very nice and why it has won
charcircuit|6 months ago
Only if you are using a new enough version of unicode. If you were using an older version it is more than 1. As new unicode updates come out, the number of grapheme clusters a string has can change.
minebreaker|6 months ago
I don't understand. It depends on the encoding isn't it?
com2kid|6 months ago
jibal|6 months ago
But that isn't the same across all languages, or even across all implementations of the same language.
baq|6 months ago
impure|6 months ago
shirro|6 months ago
pseufaux|6 months ago
torstenvl|6 months ago
pwdisswordfishz|6 months ago
jibal|6 months ago
darkwater|6 months ago
Mlller|6 months ago
1. Python3 plainly distinguishes between a string and a sequence of bytes. The function `len`, as a built-in, gives the most straightforward count: for any set or sequence of items, it counts the number of these items.
2. For a sequence of bytes, it counts the number of bytes. Taking this face-palming half-pale male hodgepodge and encoding it according to UTF-8, we get 17 bytes. Thus `len("\U0001F926\U0001F3FC\u200D\u2642\uFE0F".encode(encoding = "utf-8")) == 17`.
3. After bytes, the most basic entities are Unicode code points. A Python3 string is a sequence of Unicode code points. So for a Python3 string, `len` should give the number of Unicode code points. Thus `len("\U0001F926\U0001F3FC\u200D\u2642\uFE0F") == 5`.
Anything more is and should be beyond the purview of the simple built-in `len`:
4. Grapheme clusters are complicated and nearly as arbitrary as code points, hence there are “legacy grapheme clusters” – the grapheme clusters of older Unicode versions, because they changed – and “tailored grapheme clusters”, which may be needed “for specific locales and other customizations”, and of course the default “extended grapheme clusters”, which are only “a best-effort approximation” to “what a typical user might think of as a “character”.” Cf. https://www.unicode.org/reports/tr29
Of course, there are very few use cases for knowing the number of code points, but are there really much more for the number (NB: the number) of grapheme clusters?
Anyway, the great module https://pypi.org/project/regex/ supports “Matching a single grapheme \X”. So:
5. The space a sequence of code points will occupy on the screen: certainly useful but at least dependent on the typeface that will be used for rendering and hence certainly beyond the purview of a simple function.troupo|6 months ago
Sniffnoy|6 months ago
TacticalCoder|6 months ago
[deleted]
unknown|6 months ago
[deleted]
spyrja|6 months ago
danhau|6 months ago
Unicode definitely has its faults, but on the whole it‘s great. I‘ll take Unicode w/ UTF-8 any day over the mess of encodings we had before it.
Needless to say, Unicode is not a good fit for every scenario.
guappa|6 months ago
kalleboo|6 months ago
eru|6 months ago
UTF-8 is so complicated, because it wants to be backwards compatible with ASCII.
Ekaros|6 months ago