Unitools – A suite of tools for working with Unicode in the browser

[+] LeifCarrotson|10 years ago|reply

These tools are exactly why ASII is not dead.

Unicode is ideal for storing text exactly as the user wanted it. Which may be crazy.

But when I am writing a program for internal use, or creating a communication protocol, or trying to parse one program's output into another, or trying to make a 7-segment display show some characters, I don't want to have to handle these crazy possibilities. Just text is fine, thank you!

[+] masklinn|10 years ago|reply

> Just text is fine, thank you!

Unicode is text, that's the whole point. "ᐃᓄᐃᑦ ᐅᖃᐅᓯᓕᕆᔨᐅᑉ. ᐃᓄᑦᑎᑑᖅᐳᑦ ᓄᓇᕗᒻᒥ, ᓄᓇᕕᒻᒥ, ᐊᑯᑭᑦᑐᒻᒥ, ᓄᓇᑦᓯᐊᕗᒻᒥ, ᓄᓇᑦᓯᐊᕐᒥ' ᐊᒻᒪᓗ ᐊᓛᓯᑲᒥ" is text, "한글 또는 조선글은 1443년 조선 제4대 임금 세종이 훈민정음(訓民正音)이라는 이름으로 창제하여 1446년에 반포한 문자로, 한국어를 표기하기 위해 만들어졌다.[1][2] 이후 한문을 고수하는 사대부들에게는 경시되기도 하였으나, 조선 왕실과 일부 양반층과 서민층을 중심으로 이어지다가 1894년 갑오개혁에서 한국의 공식적인 나라 글자가 되었고, 1910년대에 이르러 한글학자인 주시경이 '한글'이라는 이름을 사용하였다. 갈래는 표음 문자 가운데 음소 문자에 속한다. 한국에서는 한글전용법이 시행되고 있다" is text, "દેવનાગરી એક લિપિ છે. સંસ્કૃત દેવનાગરી લિપિમાં લખાતી આવી છે. દેવનાગરી લિપિ મૂળ તો સંસ્કૃત માટે જ બની છે, એટલે એમાં દરેક ચિન્હ માટે એક અને માત્ર એક જ ધ્વનિ છે. દેવનાગરીમાં ૧૨ સ્વર અને ૩૪ વ્યંજન છે" is text, and "Алфавиты на основе кириллицы являются или являлись системой письменности для 108 естественных языков, включая следующие славянские языки" is text.

It may be bothersome text I don't understand and you can't be arsed to handle, but it's text nonetheless.

[+] gjm11|10 years ago|reply

For internal use, OK, fine. But what part of the following is not "just text"? "Don’t be naïve. £100 is a lot more than €100—and Hélène knows you know that."

[+] teach|10 years ago|reply

I'd argue that there's no meaningful difference between ASCII text and UTF-8 that just happens to not use any characters higher than '~'.

As long as there's no BOM, anyway.

[+] vmorgulis|10 years ago|reply

> Just text is fine, thank you!

And Unicode has a non-negligible footprint.

I tried few times to handle that in C or C++ and everything becomes complicated (even with a library like ICU).

Each time,I had to ask myself: - What's the encoding ? - How to detect it ? - What's the collation ? - ...

I read Java strings are switching to ASCII (or 16b) because it's too much inefficient.

[+] peterkelly|10 years ago|reply

An article I always recommend on this topic:

"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" by Joel Spolsky

http://www.joelonsoftware.com/articles/Unicode.html

[+] david-given|10 years ago|reply

Getting Unicode wrong can kill.

http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two...

> The use of "i" resulted in an SMS with a completely twisted meaning: instead of writing the word "sıkısınca" it looked like he wrote "sikisince." Ramazan wanted to write "You change the topic every time you run out of arguments" (sounds familiar enough) but what Emine read was, "You change the topic every time they are fucking you"...

[+] peterkelly|10 years ago|reply

From the article, it sounds as if that wasn't the only factor...

[+] coldtea|10 years ago|reply

Besides the obvious batshit craziness of those involved, that's not unique to unicode.

After all, the turkish encoding wasn't ASCII in the DOS/Unix years but ISO 8859-9 (and various other, e.g. an IBM one etc).

[+] teach|10 years ago|reply

Despite the tongue-in-cheek tagline on the site, these are some really neat tools.

[+] kdkooo|10 years ago|reply

I agree! It's always disappointing when an invitation to contest a controversial statement outweighs the information they were actually trying to relay. Would love to seem more comments and discussion on clever applications of some of these tools.

[+] coldcode|10 years ago|reply

Heck even EBCDIC isn't dead. Nothing dies forever as much as you'd like it to.

[+] mhuffman|10 years ago|reply

Embedded systems would like to talk with you ...

[+] vmorgulis|10 years ago|reply

Doesn't translate "é" as "&eactue;" ("é" instead).

[+] ygra|10 years ago|reply

It generally changes characters to numerical character references, not HTML character entities (which, incidentally, are a rather limited set of characters and pretty much useless these days – they were useful to include a few non-ASCII latin characters from Latin 1 in the mid-90s into HTML pages with browser charset detection messing things up if included directly; by now you just use Unicode and forget that a named set of entities ever existed).

[+] alexfisher|10 years ago|reply

ASCII 4 Life!

56 comments