top | item 11248847

Xterm(1) now UTF-8 by default on OpenBSD

149 points| protomyth | 10 years ago |undeadly.org | reply

137 comments

order
[+] jhallenworld|10 years ago|reply
I and others have pushed changes into XTerm to improve mouse support of terminal-based applications. All terminal emulators should implement XTerm's command set, especially these:

Bracketed paste mode: allows editor to determine that text is from a mouse paste instead of typed in. This way, the editor can disable auto-indent and other things which can mess up the paste. Libvte now supports this!

Base64 selection transfer: this is a further enhancement which allows the editor to query or submit selection text to the X server. This allows editors to fully control the selection process, for example to allow the selection to extend through the edit buffer instead of just the terminal emulator's contents.

One patch of mine didn't take, but I think it's still needed: allow mouse drag events to be reported even if the coordinates extend beyond the xterm window frame. Along with this is the ability to report negative coordinates if the mouse is above or to the left of the window. Why would this be needed? Think of selecting text which is scrolled off the window. The distance between edge and the mouse controls the rate of selection scrolling in that direction.

BTW, it's fun to peruse xterm's change log. For example, you can see all the bugs and enhancements from Bram Moolenaar for VIM. http://invisible-island.net/xterm/xterm.log.html

Thomas Dickey maintains a lot of other software as well, in particular ncurses, vile and lynx: http://invisible-island.net/

[+] caf|10 years ago|reply
Bracketed paste mode is also useful for IRC, to prevent misfiring a huge paste into a channel.
[+] sherr|10 years ago|reply
Yes, Thomas Dickey's been maintaining xterm and a lot else for donkey's years now. A lot of people owe him a big "thank you" for all his hard work. Thanks Thomas.
[+] spedru|10 years ago|reply
Every time some link or headline reads "now UTF-8 by default", the only reasonable response in 2016 is "about time".
[+] JoachimSchipper|10 years ago|reply
That's not why this article is interesting. Rather, it highlights how profoundly not UTF-8 ready the (terminal) world is.

(It does work in practice, but in-band signaling over a channel carrying complex data that receiver and sender interpret according to settings that do not appear in the protocol at all is, predictably, terrible.)

[+] thisrod|10 years ago|reply
This reminded me of a Rob Pike comment. I can't find the text, but it was along the lines of, "I recently tried Linux. It was as if every bug I fixed in the 1980s had reverted."
[+] kazinator|10 years ago|reply
That was baseless posturing. A famous study and its follow-up found that the utilities on GNU/Linux are more robust, and that was twenty years ago:

ftp://ftp.cs.wisc.edu/paradyn/technical_papers/fuzz-revisited.pdf [1995]

"This study parallels our 1990 study (that tested only the basic UNIX utilities); all systems that we compared between 1990 and 1995 noticeably improved in reliability, but still had significant rates of failure. The reliability of the basic utilities from GNU and Linux were noticeably better than those of the commercial systems."

I doubt there has been much improvement in those commercial Unixes; they are basically dead. (What would be the business case for fixing something in userland utility on commerical Unix?)

The maintainers of the free BSD's have been carrying that torch, but they don't believe in features.

Stepping into a BSD variant is like a trip back to the 1980's. Not exactly the real 1980's, but a parallel 1980's in which Unix is more robust---but the features are all rolled back, so it's just about as unpleasant to use.

[+] tempodox|10 years ago|reply
Last I checked, a call to fgetwc(3) on Linux crashes as soon as I actually enter a non-ASCII character, with a locale of en_US.UTF-8.
[+] igravious|10 years ago|reply
I've been trying to teach myself some unicode code points because I'm getting sick and tired of continually Googling them and copying and pasting the result or bringing up a symbol character table.

In fact, I'd say keyboards are woefully out to date.

Specifically, I keep looking up † dagger (U+2020) and ‡ double-dagger (U+2021) for footnotes, black heart (U+2065) to be romantic, black star (U+2605) to talk about David Bowie's last album and ∞ to talk about actual non-finite entities.

I olny found out recently that Ctrl+Shift+u and then type unicode hexadecimal outputs these in Ubuntu, presumably all Linuxen. AltGr+8 is great for diaeresis while we're at it so you can go all hëävÿ mëtäl really easily.

edit: black heart and star are not making it through, why Lord, why?!

[+] kbd|10 years ago|reply
I have a stupid little 'clip' program I wrote that has a dictionary of common texts that I can call by name and have added to the clipboard.

    $ clip lod
    $ pbpaste
    ಠ_ಠ
Maybe you can do the same without needing to remember code points. Something like TextExpander would accomplish the same thing.
[+] elros|10 years ago|reply
On OS X, if you type Command+Control+Space, it brings up a character insertion menu where you can search by character name. I can get both daggers, black star and black heart quite quickly that way.
[+] dylan-m|10 years ago|reply
Another really handy thing is the Compose key. If you're using GNOME it's under Keyboard Settings, under Shortcuts / Typing. I have it set to Right Alt. The idea is there's just a whole bunch of memorable key sequences for various common Unicode characters. For example, Alt + o + o = °; < + 3 = black heart, < + " = “, etc. It doesn't have all of the ones you like, but it's helpful :)
[+] jcranmer|10 years ago|reply
That Ctrl+Shift+u hint is nice. Now I can type all the time I want without having to browse for the emoji page to copy it.

And it sucks that I have to use so much that I know the code point for it (1F4A9) off the top of my head. :-(

Edit: I'm definitely putting in U+1F4A9 (the PILE OF POO character), but apparently hacker news strips it out. I'm guessing it's filtering everything that has a symbol character class?

[+] scrupulusalbion|10 years ago|reply
A few months ago, I had the idea to remake the old Space Cadet keyboard. One change was to make the bucky bits (e.g. control, alt, meta, super, etc.) allow you to type unicode characters instead of APL characters. Other than that and having lower case parentheses (not needing to use shift to type ( or ) ), the keyboard would be like any other mechanical keyboard.
[+] JdeBP|10 years ago|reply
> In fact, I'd say keyboards are woefully out to date.

I wrote a virtual terminal subsystem a while ago. I gave it keyboard layouts with the ISO 9995-3 common secondary group. No daggers, alas. But ISO 9995-3 does have pretty much all of the combining diacritical marks. <Group2> <Level3>+D05 is combining diaeresis. In practice I find myself not appreciating that as much as I appreciate being able to type U+00A7 as <Group2> <Level2>+C02.

[+] unfamiliar|10 years ago|reply
I have an Alfred workflow that fuzzy-searches through all unicode characters by name and inserts the character when selected. All it takes is a good interface to make it fluid.
[+] Grue3|10 years ago|reply
Ctrl-Shift-u also works in GIMP, even on Windows. I guess it's a GTK feature.
[+] gpvos|10 years ago|reply
Wouldn't it be better if all those dangerous escape sequences (like Application Program-Control, redefining function keys, alternate character sets, etc.) were disabled by default in xterm? Anyone using the obsolete software that uses them could enable them if they wish.
[+] deathanatos|10 years ago|reply
Repeat after me: UTF-8 is the sane default in this day and age. This is a good change.

The whole "the ISO 6429 C1 control code 'application program command'" thing is a bit surprising though. (I'm guessing this change doesn't actually avoid this directly? If you sent an APC it'd still do it, it's just that APC is multiple bytes in UTF-8, and hopefully a bit rarer?)

> Reinterpreting US-ASCII in an arbitrary encoding

This way will likely work — at least, I thought. The vast majority of encodings are a superset of ASCII, so reinterpreting ASCII as them is valid. The only one I know of that isn't is EBCDIC, and I've never seen it used. (Said differently, non-superset-of-ASCII codecs are incredible rare to encounter, so the above assumption usually holds.) (The reverse, reinterpreting arbitrary data as ASCII, is not going to work out as well.)

Though it is rather horrifying how easily it is to dump arbitrary data into a terminals stream. Unix does not make this easy for the program. The vast majority of programs, I'd say, really just want to output text. Yet, they're connected to a terminal. Or better, if perhaps a program could say, "I'm outputting arbitrary binary data", or even "I'm outputting a application/tar+gzip"; the terminal would then know immediately to not interpret this input. And in the case of tar+gzip, it would have the opportunity to do something truly magical: it could visualize the octets (since trying to interpret a gzip as UTF-8 is insane); it could even just note that the output was a tar, and list the tar's contents like tar -t. If the program declares itself aware, like "application/terminal.ansi", then okay, you know: it's aware; interpret away.

But it doesn't, so it can't. Part of the difficulty is probably that the TTY is both input and output (not that the input can't also declare a mimetype or something similar). And the vast majority of programs don't escape their user input before sending it to a terminal; it's like one giant "terminal-XSS" or "SQL-injection-for-your-terminal". And it is probably unreasonable to expect it; I don't really know of any good libraries around terminal I/O; most programs I see that do it assume the world is an xterm and just encode the raw bytes, right there, and pray w.r.t. user input.

catting the linux kernel's gzip into tmux can have consequences from "lol" to "I guess we need a new tmux session".

It was also just today that I discovered that neither GNU's `ps` nor `screen` support Unicode, at least, for characters outside the BMP.

[+] comex|10 years ago|reply
UTF-16 isn't a superset of ASCII, for one. Doesn't seem that anyone uses a native UTF-16 terminal, but if you're trying to use grep or whatnot on a UTF-16 encoded file, it'll happily silently not do what you want...
[+] zkirill|10 years ago|reply
This is really great! Just a few days ago I got very confused when I saw tofu characters in xterm and had to switch to uxterm to see them (or set some locale flag in my home dir).
[+] plugnburn|10 years ago|reply
UTF-8 must be the default and only encoding. Why does anything else still exist?
[+] Grue3|10 years ago|reply
Because you don't want to give the geniuses who came up with stuff like "Han Unification" a monopoly on encoding.
[+] jmnicolas|10 years ago|reply
Yes but UTF-8 with or without byte order mark ? ;-)
[+] Tiksi|10 years ago|reply
ANSI must be the default and only encoding. Why does anything else still exist?
[+] kazinator|10 years ago|reply
Great! Now just drop the embarrassing man(1) page reference, and you can call it modernized.

Wow, I'm surprised that the people whose buttons this pushes are able to make(1) a HN account, let alone have enough points to downvote.

Think about it. There is only one man page for xterm. I fyou type "man xterm" with no section number you get that man page. If there existed an xterm(7) page, you'd still get the xterm(1) man page by default. So why the hell write the (1) notation every time you type the word xterm?

Man page section numbers are not useful or relevant, by and large and mentioning them only adds noise to a paragraph.

Even stupider is when the worst of the Unix wankers write man page section numbers after ISO C function names. Example sentence: "Microsoft's malloc(3) implementation is found in MSVCRT.DLL". #facepalm#

[+] gjvc|10 years ago|reply
>Think about it. There is only one man page for xterm. I fyou type "man xterm" with no section number you get that man page. If there existed an xterm(7) page, you'd still get the xterm(1) man page by default. So why the hell write the (1) notation every time you type the word xterm?

Because the convention exists to define the type of the component. It's a handy convention, and I'm betting there are a few people reading this who have never used anything other than GNOME terminal so appending the section number immediately helps the reader to place the component, otherwise they'd have to look it up. etc

[+] neerdowell|10 years ago|reply
OpenBSD's malloc(3) implementation is found in sys/kern/kern_malloc.c, and OpenBSD's malloc(9) implementation is found in lib/libc/stdlib/malloc.c
[+] recursive|10 years ago|reply
Huh. I always thought those parenthesized numbers after unix commands were version numbers.
[+] klodolph|10 years ago|reply
Don't take the downvotes personally, it's just uninteresting content getting moderated.