top | item 30404887

(no title)

iqanq | 4 years ago

The Wild West of Trying to Shove UTF-8 onto a UTF-16 Operating System

discuss

dgellow|4 years ago

Since 2019 Windows has support for a UTF-8 codepage. You can enable it via the application manifest.

https://docs.microsoft.com/en-us/windows/apps/design/globali...

That's just for the application though, the operation system itself is still using UTF-16.

Edit: ah, nevermind, that's mentioned by the author in the article

ChrisSD|4 years ago

The issue there is that the Windows kernel does not validate Unicode. Linux has the same essential problem. Or more so, in that it does not specify an encoding at all.

tialaramex|4 years ago

They're essentially identical problems. In Linux these strings are arbitrary bytes with a rule forbidding byte 0x00, in Windows they're arbitrary unsigned 16-bit integers, I don't know if 0x0000 is forbidden. Not a significant difference.

Rust's OsString provides a structure that has the potentially nonsensical raw data inside it, and offers ways to ask for:

* The Unicode text if that's what is actually encoded (or else None)

* The text after applying Unicode decoding rules and substituting U+FFFD (Unicode's "Replacement Character" �) where errors occur

* The actual raw bytes / 16-bit unsigned integers

If all your program cared about was a filename, this is not coincidentally also how these operating systems spell filenames, so you can just hand the OsString to an OS-level file API, to open it, rename it or whatever without caring if it is Unicode or not.

The U+FFFD option does expose one potential surprise on Unix flavour systems which doesn't exist on Windows because U+FFFD is one UTF-16 code unit, but it is three UTF-8 code units, so such a "decoded" input could exceed the size of some internal storage you'd assumed would never need to be larger than the command line maximum length.