I wonder what the pros and cons weighed in the discussion were.
Clearly not supporting Unicode text in non-UTF-8 locales (except through, like, some kind of compatibility function, like recode or iconv) is the Right Thing. One problem that I have is that current UTF-8 implementations typically are not "8 bit clean", in the sense that GNU and modern Unix tools typically attempt to be; they crash, usually by throwing an exception, if you feed them certain data, or worse, they silently corrupt it.
Markus Kuhn suggested "UTF-8B" as a solution to this problem some years ago. Quoting Eric Tiedemann's libutf8b blurb, "utf-8b is a mapping from byte streams to unicode codepoint streams that provides an exceptionally clean handling of garbage (i.e., non-utf-8) bytes (i.e., bytes that are not part of a utf-8 encoding) in the input stream. They are mapped to 256 different, guaranteed undefined, unicode codepoints." Eric's dead, but you can still get libutf8b from http://hyperreal.org/~est/libutf8b/.
I'm willing to bet a large amount that non UTF-8 encoding were broken and nobody cared enough to bother fixing them.
OpenBSD does not hesitate to nuke legacy stuff that gets broken. Which i feel is ultimately for the best, because half-assed support that barely functions is worse than no support at all many times.
The tl;dr is to map an invalid UTF-8 byte n to code point U+DC00 + n, which puts it in the code point range reserved for the second part of a surrogate pair. (In UTF-16, a 16-bit value between D800 and DBFF followed by a 16-bit value between DC00 and DFFF is used to encode a code point that cannot fit in 16 bits. Since these "surrogate pairs" happen only in that order, there is room to extend UTF-16 by assigning a meaning to a DC00-DFFF value seen without a D800-DBFF before it.) Since the surrogate code points are defined as not "Unicode scalar values" and cannot exist in well-formed "Unicode text", and therefore cannot be decoded from well-formed UTF-8, there's no risk of confusion.
There are some similarities with the extension of UTF-8 encoding that is sometimes called "WTF-8" https://simonsapin.github.io/wtf-8/. WTF-8 lets unchecked purportedly-UTF-16 data be parsed as a sequence of code points, encoded into an extension of UTF-8, and round-tripped back into the original array of uint16s. UTF-8B lets unchecked purportedly-UTF-8 data be parsed as a sequence of code points, encoded into an extension of UTF-16, and round-tripped back into the original array of uint8s. They're not quite compatible, because WTF-8 would encode U+DC80 as a three-byte sequence (ED B2 80), and UTF-8B would decode that into three code points (U+DCED U+DCB2 U+DC80) since U+DC80 isn't a Unicode scalar value. But if a system wanted to support both of these robust encodings simultaneously, I think you could handle this fairly clear special case.
Kuhn's idea is also used in in Python 3, so that garbage bytes can (optionally!) be decoded to Unicode strings and later losslessly turned back into the same bytes, which ensures (e.g.) that filenames that can't be decoded can still be used:
https://www.python.org/dev/peps/pep-0383/
It was sort of darkly funny to be reading along as you're quoting the guy then all of a sudden hit so matter-of-factly, "He's dead, but you can still get the thing from...." A real splash of cold water.
Reminds me of Go strings: they usually store UTF-8 but they're actually 8-bit clean:
"It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes."
> One problem that I have is that current UTF-8 implementations typically are not "8 bit clean", in the sense that GNU and modern Unix tools typically attempt to be; they crash, usually by throwing an exception
Crashing on invalid data sounds like a great idea. Leaving garbage through doesn't.
As a French-speaking person, I cannot tell you how much the announcement[0] that after 5.8, basic utilities, including mg(1), will be UTF-8 ready pleases me. I'm a huge Emacs fan, but I like to use mg(1) for quick edits and this is very exciting news for me!
AIUI, one of the big problems with using UTF-8 universally is that it's rather unfriendly to Asian character sets. e.g. apparently UTF-8 is three times bigger than TIS-620 for Thai characters (from http://www.micro-isv.asia/2009/03/why-not-use-utf-8-for-ever...).
UTC is (as everyone knows) a bit problematic due to leap seconds. Different software systems handle the leap seconds somewhat differently. Handling leap seconds is is actually quite difficult if you want to get it absolutely correct. In 99% of cases the problems are just ignored (e.g. "it doesn't matter if the chart is slightly odd looking when you look at the moment of the leap second").
There's also the problem that every software using UTC must be updated at least once every six months. That may be a lesser problem these days, but is still somewhat relevant especially in various industries.
I'd probably go with TAI and just convert the dates to the "human readable" format in the UI. Of course, that's not trivial either.
Personally, I wish everyone used the 24:00 clock. Maybe the military has messed me up, but I really prefer seeing something like 18:22 over 6:22pm. It just seems simplier.
I'd strongly consider voting for a US presidential candidate solely on if moving to metric was a big part of their platform. So much time is wasted in school on the confusing mess that is the imperial system.
I'd prefer to skip metric and redefine the foot as the distance light travels in a nanosecond ~ 11.8 inches. The length of the path travelled by light in vacuum during a time interval of 1/299792458 of a second seems a bit arbitrary to me.
I would also vote for any candidate that would ban anything but powers of 2 in the definition of computer storage.
It would be nice if graphics cards / VM bioses extended text video modes (perhaps new VESA modes) to supported unicode, preferrably utf32 to make buffer offsets easy to compute. Sure it means more fonts and not every codepoint can squeeze into a few pixels, but most additional codepoints will be good enough and better than seeing garbage on the screen.
Metric's nice, but each day being 10 hours in a 10 day week just doesn't work for me. But at least it's Fiveday, and I have a 20 hour weekened to look forward to.
[+] [-] kragen|10 years ago|reply
Clearly not supporting Unicode text in non-UTF-8 locales (except through, like, some kind of compatibility function, like recode or iconv) is the Right Thing. One problem that I have is that current UTF-8 implementations typically are not "8 bit clean", in the sense that GNU and modern Unix tools typically attempt to be; they crash, usually by throwing an exception, if you feed them certain data, or worse, they silently corrupt it.
Markus Kuhn suggested "UTF-8B" as a solution to this problem some years ago. Quoting Eric Tiedemann's libutf8b blurb, "utf-8b is a mapping from byte streams to unicode codepoint streams that provides an exceptionally clean handling of garbage (i.e., non-utf-8) bytes (i.e., bytes that are not part of a utf-8 encoding) in the input stream. They are mapped to 256 different, guaranteed undefined, unicode codepoints." Eric's dead, but you can still get libutf8b from http://hyperreal.org/~est/libutf8b/.
[+] [-] throwaway2048|10 years ago|reply
OpenBSD does not hesitate to nuke legacy stuff that gets broken. Which i feel is ultimately for the best, because half-assed support that barely functions is worse than no support at all many times.
[+] [-] geofft|10 years ago|reply
http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043...
The tl;dr is to map an invalid UTF-8 byte n to code point U+DC00 + n, which puts it in the code point range reserved for the second part of a surrogate pair. (In UTF-16, a 16-bit value between D800 and DBFF followed by a 16-bit value between DC00 and DFFF is used to encode a code point that cannot fit in 16 bits. Since these "surrogate pairs" happen only in that order, there is room to extend UTF-16 by assigning a meaning to a DC00-DFFF value seen without a D800-DBFF before it.) Since the surrogate code points are defined as not "Unicode scalar values" and cannot exist in well-formed "Unicode text", and therefore cannot be decoded from well-formed UTF-8, there's no risk of confusion.
There are some similarities with the extension of UTF-8 encoding that is sometimes called "WTF-8" https://simonsapin.github.io/wtf-8/. WTF-8 lets unchecked purportedly-UTF-16 data be parsed as a sequence of code points, encoded into an extension of UTF-8, and round-tripped back into the original array of uint16s. UTF-8B lets unchecked purportedly-UTF-8 data be parsed as a sequence of code points, encoded into an extension of UTF-16, and round-tripped back into the original array of uint8s. They're not quite compatible, because WTF-8 would encode U+DC80 as a three-byte sequence (ED B2 80), and UTF-8B would decode that into three code points (U+DCED U+DCB2 U+DC80) since U+DC80 isn't a Unicode scalar value. But if a system wanted to support both of these robust encodings simultaneously, I think you could handle this fairly clear special case.
[+] [-] ptx|10 years ago|reply
[+] [-] FroshKiller|10 years ago|reply
[+] [-] tedunangst|10 years ago|reply
[+] [-] skybrian|10 years ago|reply
"It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes."
https://blog.golang.org/strings
[+] [-] masklinn|10 years ago|reply
Crashing on invalid data sounds like a great idea. Leaving garbage through doesn't.
[+] [-] gnuvince|10 years ago|reply
[0] http://undeadly.org/cgi?action=article&sid=20150722182236
[+] [-] david-given|10 years ago|reply
[+] [-] busterarm|10 years ago|reply
...Emacs is the only package in the entire ports tree that can't use ASLR.
[+] [-] fletchowns|10 years ago|reply
[+] [-] nawitus|10 years ago|reply
There's also the problem that every software using UTC must be updated at least once every six months. That may be a lesser problem these days, but is still somewhat relevant especially in various industries.
I'd probably go with TAI and just convert the dates to the "human readable" format in the UI. Of course, that's not trivial either.
[+] [-] RexRollman|10 years ago|reply
[+] [-] toyg|10 years ago|reply
[+] [-] ohitsdom|10 years ago|reply
[+] [-] protomyth|10 years ago|reply
I would also vote for any candidate that would ban anything but powers of 2 in the definition of computer storage.
[+] [-] specialist|10 years ago|reply
https://en.wikipedia.org/wiki/Hanke-Henry_Permanent_Calendar
My pick of the numerous proposals. Mostly because I can understand it.
https://en.wikipedia.org/wiki/Calendar_reform
[+] [-] bro-stick|10 years ago|reply
[+] [-] peterfirefly|10 years ago|reply
[+] [-] NoMoreNicksLeft|10 years ago|reply
[+] [-] mythz|10 years ago|reply
[+] [-] jlarocco|10 years ago|reply
[+] [-] Animats|10 years ago|reply
[+] [-] ori_b|10 years ago|reply
[+] [-] khaled|10 years ago|reply
[+] [-] TazeTSchnitzel|10 years ago|reply
[+] [-] jandrese|10 years ago|reply