top | item 10061028

OpenBSD removes support for non-UTF8 locales

236 points| ingve | 10 years ago |marc.info | reply

180 comments

[+] kragen|10 years ago|reply

I wonder what the pros and cons weighed in the discussion were.

Clearly not supporting Unicode text in non-UTF-8 locales (except through, like, some kind of compatibility function, like recode or iconv) is the Right Thing. One problem that I have is that current UTF-8 implementations typically are not "8 bit clean", in the sense that GNU and modern Unix tools typically attempt to be; they crash, usually by throwing an exception, if you feed them certain data, or worse, they silently corrupt it.

Markus Kuhn suggested "UTF-8B" as a solution to this problem some years ago. Quoting Eric Tiedemann's libutf8b blurb, "utf-8b is a mapping from byte streams to unicode codepoint streams that provides an exceptionally clean handling of garbage (i.e., non-utf-8) bytes (i.e., bytes that are not part of a utf-8 encoding) in the input stream. They are mapped to 256 different, guaranteed undefined, unicode codepoints." Eric's dead, but you can still get libutf8b from http://hyperreal.org/~est/libutf8b/.

[+] throwaway2048|10 years ago|reply

I'm willing to bet a large amount that non UTF-8 encoding were broken and nobody cared enough to bother fixing them.

OpenBSD does not hesitate to nuke legacy stuff that gets broken. Which i feel is ultimately for the best, because half-assed support that barely functions is worse than no support at all many times.

[+] geofft|10 years ago|reply

For the benefit of others (the link is nonobvious), here's Markus Kuhn's presentation of UTF-8B:

http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043...

The tl;dr is to map an invalid UTF-8 byte n to code point U+DC00 + n, which puts it in the code point range reserved for the second part of a surrogate pair. (In UTF-16, a 16-bit value between D800 and DBFF followed by a 16-bit value between DC00 and DFFF is used to encode a code point that cannot fit in 16 bits. Since these "surrogate pairs" happen only in that order, there is room to extend UTF-16 by assigning a meaning to a DC00-DFFF value seen without a D800-DBFF before it.) Since the surrogate code points are defined as not "Unicode scalar values" and cannot exist in well-formed "Unicode text", and therefore cannot be decoded from well-formed UTF-8, there's no risk of confusion.

There are some similarities with the extension of UTF-8 encoding that is sometimes called "WTF-8" https://simonsapin.github.io/wtf-8/. WTF-8 lets unchecked purportedly-UTF-16 data be parsed as a sequence of code points, encoded into an extension of UTF-8, and round-tripped back into the original array of uint16s. UTF-8B lets unchecked purportedly-UTF-8 data be parsed as a sequence of code points, encoded into an extension of UTF-16, and round-tripped back into the original array of uint8s. They're not quite compatible, because WTF-8 would encode U+DC80 as a three-byte sequence (ED B2 80), and UTF-8B would decode that into three code points (U+DCED U+DCB2 U+DC80) since U+DC80 isn't a Unicode scalar value. But if a system wanted to support both of these robust encodings simultaneously, I think you could handle this fairly clear special case.

[+] ptx|10 years ago|reply

Kuhn's idea is also used in in Python 3, so that garbage bytes can (optionally!) be decoded to Unicode strings and later losslessly turned back into the same bytes, which ensures (e.g.) that filenames that can't be decoded can still be used: https://www.python.org/dev/peps/pep-0383/

[+] FroshKiller|10 years ago|reply

It was sort of darkly funny to be reading along as you're quoting the guy then all of a sudden hit so matter-of-factly, "He's dead, but you can still get the thing from...." A real splash of cold water.

[+] tedunangst|10 years ago|reply

In a roundabout way, this is because I wasn't able to push through an isprint() workaround diff to ls. http://marc.info/?l=openbsd-misc&m=142540203528315&w=2

[+] skybrian|10 years ago|reply

Reminds me of Go strings: they usually store UTF-8 but they're actually 8-bit clean:

"It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes."

https://blog.golang.org/strings

[+] masklinn|10 years ago|reply

> One problem that I have is that current UTF-8 implementations typically are not "8 bit clean", in the sense that GNU and modern Unix tools typically attempt to be; they crash, usually by throwing an exception

Crashing on invalid data sounds like a great idea. Leaving garbage through doesn't.

[+] gnuvince|10 years ago|reply

As a French-speaking person, I cannot tell you how much the announcement[0] that after 5.8, basic utilities, including mg(1), will be UTF-8 ready pleases me. I'm a huge Emacs fan, but I like to use mg(1) for quick edits and this is very exciting news for me!

[0] http://undeadly.org/cgi?action=article&sid=20150722182236

[+] david-given|10 years ago|reply

AIUI, one of the big problems with using UTF-8 universally is that it's rather unfriendly to Asian character sets. e.g. apparently UTF-8 is three times bigger than TIS-620 for Thai characters (from http://www.micro-isv.asia/2009/03/why-not-use-utf-8-for-ever...).

[+] busterarm|10 years ago|reply

Funny thing about Emacs and OpenBSD...

...Emacs is the only package in the entire ports tree that can't use ASLR.

[+] fletchowns|10 years ago|reply

I dream of a world where everything is UTC, UTF-8, and metric.

[+] nawitus|10 years ago|reply

UTC is (as everyone knows) a bit problematic due to leap seconds. Different software systems handle the leap seconds somewhat differently. Handling leap seconds is is actually quite difficult if you want to get it absolutely correct. In 99% of cases the problems are just ignored (e.g. "it doesn't matter if the chart is slightly odd looking when you look at the moment of the leap second").

There's also the problem that every software using UTC must be updated at least once every six months. That may be a lesser problem these days, but is still somewhat relevant especially in various industries.

I'd probably go with TAI and just convert the dates to the "human readable" format in the UI. Of course, that's not trivial either.

[+] RexRollman|10 years ago|reply

Personally, I wish everyone used the 24:00 clock. Maybe the military has messed me up, but I really prefer seeing something like 18:22 over 6:22pm. It just seems simplier.

[+] toyg|10 years ago|reply

And ISO 8601. Sweet, sweet ISO 8601...

[+] ohitsdom|10 years ago|reply

I'd strongly consider voting for a US presidential candidate solely on if moving to metric was a big part of their platform. So much time is wasted in school on the confusing mess that is the imperial system.

[+] protomyth|10 years ago|reply

I'd prefer to skip metric and redefine the foot as the distance light travels in a nanosecond ~ 11.8 inches. The length of the path travelled by light in vacuum during a time interval of 1/299792458 of a second seems a bit arbitrary to me.

I would also vote for any candidate that would ban anything but powers of 2 in the definition of computer storage.

[+] specialist|10 years ago|reply

My dream also includes the Hanke-Henry calender.

https://en.wikipedia.org/wiki/Hanke-Henry_Permanent_Calendar

My pick of the numerous proposals. Mostly because I can understand it.

https://en.wikipedia.org/wiki/Calendar_reform

[+] bro-stick|10 years ago|reply

It would be nice if graphics cards / VM bioses extended text video modes (perhaps new VESA modes) to supported unicode, preferrably utf32 to make buffer offsets easy to compute. Sure it means more fonts and not every codepoint can squeeze into a few pixels, but most additional codepoints will be good enough and better than seeing garbage on the screen.

[+] peterfirefly|10 years ago|reply

IPA would be nice, too.

[+] NoMoreNicksLeft|10 years ago|reply

Metric's nice, but each day being 10 hours in a 10 day week just doesn't work for me. But at least it's Fiveday, and I have a 20 hour weekened to look forward to.

[+] mythz|10 years ago|reply

And \n line delimiters and \s+ tabs!

[+] jlarocco|10 years ago|reply

Heh, I initially read it as "improves", and was wondering why they'd bother. Removing it is surprising, but makes sense.

[+] Animats|10 years ago|reply

How does locale work on the keyboard side, then? What determines whether text entry is right to left or left to right?

[+] ori_b|10 years ago|reply

Exactly the same as before -- your programs just expect UTF8 codepoints as input.

[+] khaled|10 years ago|reply

Keyboard layout is independent of the locale used, and so are the directionality of the text which is a property of the characters themselves.

[+] TazeTSchnitzel|10 years ago|reply

Input methods don't need to align with encoding.

[+] jandrese|10 years ago|reply

Presumably if you press the RTL key on your keyboard, or if you enter the sequence of codes that converts it from LTR to RTL.