top | item 29304169

(no title)

tn13 | 4 years ago

English speaking world has developed intuition about strings due to ASCII which simply fails when it comes to Unicode and that basically explains a lot of these pitfalls.

String length when defined #2 is also fairly complex when it comes to some languages such as Hindi. There are some symbols in Hindi which are not characters and can never exist as their own character but when placed next to a character they create a new character. So when you type them out on a keyboard you have to bit two keys but only one character will appear on screen. Unicode too represents this as two separate characters but for human eye it is one.

त + या = त्या

Following code will print 4

console.log("त्या".length);

discuss

order

DemocracyFTW|4 years ago

"symbols in Hindi which are not characters and can never exist as their own character but when placed next to a character they create a new character"

a.k.a. 'ligatures', as in f+f+i -> U+fb03 'ffi'

nisegami|4 years ago

I would consider ligatures a text rendering concept, which allows for but is distinct from the linguistic concept described by GP.

Edit: to further illustrate my point, in the ligatures I'm familiar with (including the ones in your link), the component characters exist standalone and can be used on their own, unlike GP's example.

signal11|4 years ago

Swift handles this really well,

"त्या".count // 1

"त्या".unicodeScalars.count // 4

"त्या".utf8.count // 12

Javascript's minimal library is of course not great, but there are libraries which can help, e.g. grapheme-splitter, although it's not language-aware by design, so in this instance it'll return 2.

graphemeSplitter.countGraphemes("त्या") // 2

professoretc|4 years ago

We even already had something like this in pure ASCII: "a\bc" has "length" 3 but appears as one glyph when printed (assuming your terminal interprets backspace).

int_19h|4 years ago

This made me think of Hangul, when not using the precomposed block. What's the string length of 한글?

raiph|4 years ago

In the Rakudo compiler for Raku that I just tried its "chars" count using the default EGC counting is 2.