top | item 11209371

Unitools – A suite of tools for working with Unicode in the browser

55 points| causality | 10 years ago |unicod.es | reply

56 comments

order
[+] LeifCarrotson|10 years ago|reply
These tools are exactly why ASII is not dead.

Unicode is ideal for storing text exactly as the user wanted it. Which may be crazy.

But when I am writing a program for internal use, or creating a communication protocol, or trying to parse one program's output into another, or trying to make a 7-segment display show some characters, I don't want to have to handle these crazy possibilities. Just text is fine, thank you!

[+] masklinn|10 years ago|reply
> Just text is fine, thank you!

Unicode is text, that's the whole point. "ᐃᓄᐃᑦ ᐅᖃᐅᓯᓕᕆᔨᐅᑉ. ᐃᓄᑦᑎᑑᖅᐳᑦ ᓄᓇᕗᒻᒥ, ᓄᓇᕕᒻᒥ, ᐊᑯᑭᑦᑐᒻᒥ, ᓄᓇᑦᓯᐊᕗᒻᒥ, ᓄᓇᑦᓯᐊᕐᒥ' ᐊᒻᒪᓗ ᐊᓛᓯᑲᒥ" is text, "한글 또는 조선글은 1443년 조선 제4대 임금 세종이 훈민정음(訓民正音)이라는 이름으로 창제하여 1446년에 반포한 문자로, 한국어를 표기하기 위해 만들어졌다.[1][2] 이후 한문을 고수하는 사대부들에게는 경시되기도 하였으나, 조선 왕실과 일부 양반층과 서민층을 중심으로 이어지다가 1894년 갑오개혁에서 한국의 공식적인 나라 글자가 되었고, 1910년대에 이르러 한글학자인 주시경이 '한글'이라는 이름을 사용하였다. 갈래는 표음 문자 가운데 음소 문자에 속한다. 한국에서는 한글전용법이 시행되고 있다" is text, "દેવનાગરી એક લિપિ છે. સંસ્કૃત દેવનાગરી લિપિમાં લખાતી આવી છે. દેવનાગરી લિપિ મૂળ તો સંસ્કૃત માટે જ બની છે, એટલે એમાં દરેક ચિન્હ માટે એક અને માત્ર એક જ ધ્વનિ છે. દેવનાગરીમાં ૧૨ સ્વર અને ૩૪ વ્યંજન છે" is text, and "Алфавиты на основе кириллицы являются или являлись системой письменности для 108 естественных языков, включая следующие славянские языки" is text.

It may be bothersome text I don't understand and you can't be arsed to handle, but it's text nonetheless.

[+] gjm11|10 years ago|reply
For internal use, OK, fine. But what part of the following is not "just text"? "Don’t be naïve. £100 is a lot more than €100—and Hélène knows you know that."
[+] teach|10 years ago|reply
I'd argue that there's no meaningful difference between ASCII text and UTF-8 that just happens to not use any characters higher than '~'.

As long as there's no BOM, anyway.

[+] vmorgulis|10 years ago|reply
> Just text is fine, thank you!

And Unicode has a non-negligible footprint.

I tried few times to handle that in C or C++ and everything becomes complicated (even with a library like ICU).

Each time,I had to ask myself: - What's the encoding ? - How to detect it ? - What's the collation ? - ...

I read Java strings are switching to ASCII (or 16b) because it's too much inefficient.

[+] david-given|10 years ago|reply
Getting Unicode wrong can kill.

http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two...

> The use of "i" resulted in an SMS with a completely twisted meaning: instead of writing the word "sıkısınca" it looked like he wrote "sikisince." Ramazan wanted to write "You change the topic every time you run out of arguments" (sounds familiar enough) but what Emine read was, "You change the topic every time they are fucking you"...

[+] peterkelly|10 years ago|reply
From the article, it sounds as if that wasn't the only factor...
[+] coldtea|10 years ago|reply
Besides the obvious batshit craziness of those involved, that's not unique to unicode.

After all, the turkish encoding wasn't ASCII in the DOS/Unix years but ISO 8859-9 (and various other, e.g. an IBM one etc).

[+] teach|10 years ago|reply
Despite the tongue-in-cheek tagline on the site, these are some really neat tools.
[+] kdkooo|10 years ago|reply
I agree! It's always disappointing when an invitation to contest a controversial statement outweighs the information they were actually trying to relay. Would love to seem more comments and discussion on clever applications of some of these tools.
[+] coldcode|10 years ago|reply
Heck even EBCDIC isn't dead. Nothing dies forever as much as you'd like it to.
[+] mhuffman|10 years ago|reply
Embedded systems would like to talk with you ...
[+] vmorgulis|10 years ago|reply
Doesn't translate "é" as "&eactue;" ("é" instead).
[+] ygra|10 years ago|reply
It generally changes characters to numerical character references, not HTML character entities (which, incidentally, are a rather limited set of characters and pretty much useless these days – they were useful to include a few non-ASCII latin characters from Latin 1 in the mid-90s into HTML pages with browser charset detection messing things up if included directly; by now you just use Unicode and forget that a named set of entities ever existed).