top | item 32067357

(no title)

korlja | 3 years ago

.ToUpper() is locale-dependent, so can only be used if the locale of the text in question is known. E.g. German ß capitalizes to SS, and .ToUpper().ToLower() should give you either 'ss' or 'ß' depending on what it was before. Always outputting 'ss' is okish and readable, but actually wrong.

Blindly calling .ToUpper() on anything is a typical anglo-centric mistake. Just don't use .ToUpper(), shoutcase is ugly anyways ;)

See also: one of the many "100 fallacies programmers assume about natural written language" documents or such.

discuss

AdrianoKF|3 years ago

Small nitpick: uppercase ẞ was added to Unicode 5.1 in 2007 (https://unicode-table.com/en/1E9E/) and is considered correct German orthography since 2017 (see §25 E3 in https://grammis.ids-mannheim.de/rechtschreibung/6180#par25E3)

usr1106|3 years ago

How often do you see the new letter in German everyday life? Despite being German myself I don't visit Germany that often these days, I still read a couple of German publications regularly. I have never seen the new letter outside of discussions by software people about character handling.

korlja|3 years ago

That is correct and solves the roundtrip-problem (in this case and language). But uppercase 'ẞ' is just an additional option at the discretion of the writer, the recommended variant continues to be 'SS'.

egeozcan|3 years ago

> German ß capitalizes to SS, and .ToUpper().ToLower() should give you either 'ss' or 'ß' depending on what it was before

As long as there is no unicode SS character, we are into the "what color are your bits" problem or tolower needs to be language and word aware.

In .NET the uppercase and lowercase functions are culture aware (with defaults to system settings, which breaks more software than you might think) but not word aware AFAIK.

bee_rider|3 years ago

> As long as there is no unicode SS character, we are into the "what color are your bits" problem or tolower needs to be language and word aware.

It turns out there is such a unicode character -- ẞ/ß -- although based on other comments here it looks like it was added fairly recently.

Upper/Lower case stuff just seems to be at an annoying intersection where it has cultural and also programming significance. Or at least, people will use toUpper when they really want some case-insensitive sortable version of the string.

(based on some googling, probably localeCompare is the way to go in javascript at least).

3836293648|3 years ago

I hate the locale nonsense. The decimal point is `.` and not `,`. The rest of this stupid country is wrong

Hamuko|3 years ago

>Blindly calling .ToUpper() on anything is a typical anglo-centric mistake.

Yes, one that you might make if you were for example, trying to make English text uppercase. Which is why it would be daft for anyone to suggest that their country has two different English spellings depending on the character case.

d1sxeyes|3 years ago

.toUpper() is a quick and mostly effective way to normalise strings for comparison if you're not sure what case the two strings to compare are in (eg: one has been input by a user). Yes, it's a shortcut, and occasionally you'll end up with a miss, but it's good enough to work 99% of the time, and the alternative is a LOT of code and data changes to handle a very small proportion of cases.

vesinisa|3 years ago

Hmm I think you miss the point. In some programming environments (like C# and Java) .toUpper() is always incorrect in code unless you are displaying the resulting string in a UI, as it uses the "current locale", which is whatever the user has selected for the machine. When e.g. comparing strings case-insensitively, you should always explicitly specify the locale where the conversion should happen instead of relying on an external configuration variable.

JavaScript actually seems to be the smart one here - its default .toUpperCase() uses the "locale-insensitive case mappings in the Unicode Character Database".

underwater|3 years ago

You make a good case (ha!). What if toUpper() and toLower() were omitted from standard libraries? Usually they are used, incorrectly, to do something like string comparison, which could be better served by a more specific method.

bbu|3 years ago

Only sz should use ß. Ss stays ss even in German-german. Switzerland got rid of the sz/ss distinction a long time ago. So you need to be culture and word aware to do it „right“.

korlja|3 years ago

'sz' for 'ß' is sometimes used to make things roundtrip-proof in capslock, e.g. on military stencils. HTML calls it 'szlig'. Also, some use "Esszet" as the name of the character. But all are wrong in that ß isn't a ligature of s and z, it is a ligature of s and s. The shape of the character stems from the fact that in fractur writing and even some grotesk fonts, 's' at the end of a word was written 's', while 's' within a word was written 'ſ'. Thus the end of a word like Fuss was written Fuſs, giving a ligature of Fuß. No 'z' anywhere.