The issue there is that the Windows kernel does not validate Unicode. Linux has the same essential problem. Or more so, in that it does not specify an encoding at all.
They're essentially identical problems. In Linux these strings are arbitrary bytes with a rule forbidding byte 0x00, in Windows they're arbitrary unsigned 16-bit integers, I don't know if 0x0000 is forbidden. Not a significant difference.
Rust's OsString provides a structure that has the potentially nonsensical raw data inside it, and offers ways to ask for:
* The Unicode text if that's what is actually encoded (or else None)
* The text after applying Unicode decoding rules and substituting U+FFFD (Unicode's "Replacement Character" �) where errors occur
* The actual raw bytes / 16-bit unsigned integers
If all your program cared about was a filename, this is not coincidentally also how these operating systems spell filenames, so you can just hand the OsString to an OS-level file API, to open it, rename it or whatever without caring if it is Unicode or not.
The U+FFFD option does expose one potential surprise on Unix flavour systems which doesn't exist on Windows because U+FFFD is one UTF-16 code unit, but it is three UTF-8 code units, so such a "decoded" input could exceed the size of some internal storage you'd assumed would never need to be larger than the command line maximum length.
dgellow|4 years ago
https://docs.microsoft.com/en-us/windows/apps/design/globali...
That's just for the application though, the operation system itself is still using UTF-16.
Edit: ah, nevermind, that's mentioned by the author in the article
ChrisSD|4 years ago
tialaramex|4 years ago
Rust's OsString provides a structure that has the potentially nonsensical raw data inside it, and offers ways to ask for:
* The Unicode text if that's what is actually encoded (or else None)
* The text after applying Unicode decoding rules and substituting U+FFFD (Unicode's "Replacement Character" �) where errors occur
* The actual raw bytes / 16-bit unsigned integers
If all your program cared about was a filename, this is not coincidentally also how these operating systems spell filenames, so you can just hand the OsString to an OS-level file API, to open it, rename it or whatever without caring if it is Unicode or not.
The U+FFFD option does expose one potential surprise on Unix flavour systems which doesn't exist on Windows because U+FFFD is one UTF-16 code unit, but it is three UTF-8 code units, so such a "decoded" input could exceed the size of some internal storage you'd assumed would never need to be larger than the command line maximum length.