top | item 46044695

(no title)

robmccoll | 3 months ago

What's the reason for moving from ASCII CHAR to UTF16 WCHAR rather than UTF8 CHAR? I wouldn't think any parts of the codebase that don't need to render the string or worry about character counts would need to be modified.

Edit: https://devblogs.microsoft.com/oldnewthing/20190830-00/?p=10... seems the justification was that UTF-8 didn't exist yet? Not totally accurate, but it wasn't fully standardized. Also that other article seems to imply Windows 95 used UTF16 (or UCS2, but either way 16-bit chars) so I'm confused about porting code being a problem. Was it that the APIs in 95 were still kind of a halfway point?

discuss

order

ynik|3 months ago

Windows NT started supporting unicode before UTF-8 was invented, back when Unicode was fundamentally 16-bit. As a result, in Microsoft world, WCHAR meant "supports unicode" and CHAR meant "doesn't support unicode yet".

By the way, UTF-16 also didn't exist yet: Windows started with UCS-2. Though I think the name "UCS-2" also didn't exist yet -- AFAIK that name was only introduced in Unicode 2.0 together with UCS-4/UTF-32 and UTF-16 -- in Unicode 1.0, the 16-bit encoding was just called "Unicode" as there were no other encodings of unicode.

usrnm|3 months ago

> Windows NT started supporting unicode before UTF-8 was invented

That's not true, UTF-8 predates Windows NT. It's just that the jump from ASCII to UCS2 (not even real UTF16) was much easier and natural and at the time a lot of people really thought that it would be enough. Java made the same mistake around the same time. I actually had the very same discussions with older die-hard win developers as late as 2015, for a lot of them 2 bytes per symbol was still all that you could possibly need.

throwaway2037|3 months ago

Oh god, this again. One word: "History". No one thought we would need more than 16 bits (65k chars) to represent all the world's written languages. Then it happened. There must be no less than one thousand individually authored blog posts and technical articles on this matter. Win32, Java, and Qt all suffer from the same UTF-16 internal representation. There has been endless discussion on the matter over the last 10 years about how to change these frameworks to use UTF-8 internal representation. It is a crazy hard problem.

ninkendo|3 months ago

The tragic part is how brief the period of time was between “ascii and a mess of code pages” and the problem actually getting solved with Unicode 2.0 and UTF-8.

Unicode 1.0 was in 1991, UTF-8 happened a year later, and Unicode 2.0 (where more than 65,536 characters became “official”, and UTF-8 was the recommended choice) was in 1996.

That means if you were green-fielding a new bit of tech in 1991, you likely decided 16 bits per character was the correct approach. But in 1992 it started to become clear that maybe a variable with encoding (with 8 bits as the base character size) was on the horizon. And by 1996 it was clear that fixed 16-bit characters was a mistake.

But that 5-year window was an extremely critical time in computing history: Windows NT was invented, so was Java, JavaScript, and a bunch of other things. So, too late, huge swaths of what would become today’s technical landscape had set the problem in stone.

UNIXes only use the “right” technical choice because it was already too hard to move from ASCII to 16-bit characters… but laziness in moving off of ASCII ultimately paid off as it became clear that 16-bits per character was the wrong choice in the first place. But otherwise UNIX would have had the same fate.