top | item 36233752

(no title)

Taywee | 2 years ago

I don't think that's what the complaint is. The complaint is that "multibyte" is not necessarily UTF-8. You can't just blindly convert to multibyte assuming that it's UTF-8, because it might not be. You can't convert between two encodings by just going through "multibyte", because it might actually not support all characters you might need to support.

So it really is a deficiency in C. It's nearly useless to have a "multibyte" or "wide character" encoding when those can mean anything. Having conversion between UTF-8 and UTF-32 is useful. Having conversion between "implemetation and platform dependent 'multibyte'" and "implementation and platform dependent 'wide character'" strings is nearly useless.

discuss

kps|2 years ago

C multibyte, I believe, was designed around ISO2022-style stateful code switching. It predates Unicode.

cryptonector|2 years ago

> You can't just blindly convert to multibyte assuming that it's UTF-8, because it might not be.

I mentioned that. TFA didn't say that specifically.

Taywee|2 years ago

It sort of did, but in a completely different place past the critique section:

> But, rather than using them and needing to praying to the heaven’s the internal Multibyte C Encoding is UTF-8 (like with the aforementioned wcrtomb -> mbrtoc8/16/32 style of conversions), we’ll just provide a direction conversion routine and cut out the wchar_t encoding/multibyte encoding middle man.

Not sure why it wasn't mentioned up top. When trying to convert between UTF-8 and UTF-16 without doing it myself or pulling in external dependencies, this was the most annoying thing that slapped me in the face. This is the problem that makes reliable charset conversions between specific encodings actually impossible using just the stdlib functions.