top | item 31466828

(no title)

Did I miss something in the Framework -> Core transition where strings aren't UTF-16 internally anymore?

discuss

bzxcvbn|3 years ago

Objects of type `string` are internally encoded as UTF-16. Byte arrays or spans can contain anything, including strings encoded as UTF-8 (or whatever encoding you like, or raw binary files, or random bytes). The new feature does exactly what's written in the blog post: if a string literal contains only UTF-8 characters and you assign it to a byte array or span, it gets encoded as UTF-8. It's just syntactical sugar.

This is a post about C#11 the language, not the framework or the runtime. It's only telling you about what the compiler does when it encounters some syntax. Under the hood, it probably calls some encoders from the standard library.

tialaramex|3 years ago

> if a string literal contains only UTF-8 characters and you assign it to a byte array or span, it gets encoded as UTF-8.

I write a bunch of C# for my job, but am far from an expert in the language. My reading of this statement is redundant, which means I feel sure it's trying to communicate something the authors thought was "obvious" and is not.

* A string literal - so, realistically some Unicode text, right? All the other encodings anybody was actually using can transliterate to Unicode, so, they are just Unicode (with a different encoding)

* contains only UTF-8 characters - UTF-8 is an encoding of Unicode, so, this just means Unicode again

I'm guessing actually C# can write something that's not Unicode in a String for some reason? But what that might be is unexplained:

Can you... emit arbitrary bytes? But how when your native encoding (UTF-16) isn't even byte oriented? What does that mean?

Maybe you can emit the rare Unicode "non-characters" like U+FFFF ? But, you can express those just fine in UTF-8 so who cares?

Or perhaps it's as simple as C# lets you write literals which are sequences of 16-bit code units but aren't UTF-16 ?

masklinn|3 years ago

No, the internal storage is still UTF-16.

What the "UTF-8 literals" feature is really about the conversion of strings to byte[] (and similar).

And more specifically the static initialisation of byte[]: initialising a byte[] from a string literal will store the UTF8 encoding of the literal as a bytes array and can be performed from static data (a bytes constant in the binary), before that you had to write the "new byte[]" with individual integral byte value by hand e.g.

    byte[] thing = new byte[] { 0x77, 0x6f, 0x72, 0x6c, 0x64 };

can now be written

    byte[] thing = "world";

juki|3 years ago

No, they're still UTF-16.