Unicode is ideal for storing text exactly as the user wanted it. Which may be crazy.
But when I am writing a program for internal use, or creating a communication protocol, or trying to parse one program's output into another, or trying to make a 7-segment display show some characters, I don't want to have to handle these crazy possibilities. Just text is fine, thank you!
Unicode is text, that's the whole point. "ᐃᓄᐃᑦ ᐅᖃᐅᓯᓕᕆᔨᐅᑉ. ᐃᓄᑦᑎᑑᖅᐳᑦ ᓄᓇᕗᒻᒥ, ᓄᓇᕕᒻᒥ, ᐊᑯᑭᑦᑐᒻᒥ, ᓄᓇᑦᓯᐊᕗᒻᒥ, ᓄᓇᑦᓯᐊᕐᒥ' ᐊᒻᒪᓗ ᐊᓛᓯᑲᒥ" is text, "한글 또는 조선글은 1443년 조선 제4대 임금 세종이 훈민정음(訓民正音)이라는 이름으로 창제하여 1446년에 반포한 문자로, 한국어를 표기하기 위해 만들어졌다.[1][2] 이후 한문을 고수하는 사대부들에게는 경시되기도 하였으나, 조선 왕실과 일부 양반층과 서민층을 중심으로 이어지다가 1894년 갑오개혁에서 한국의 공식적인 나라 글자가 되었고, 1910년대에 이르러 한글학자인 주시경이 '한글'이라는 이름을 사용하였다. 갈래는 표음 문자 가운데 음소 문자에 속한다. 한국에서는 한글전용법이 시행되고 있다" is text, "દેવનાગરી એક લિપિ છે. સંસ્કૃત દેવનાગરી લિપિમાં લખાતી આવી છે. દેવનાગરી લિપિ મૂળ તો સંસ્કૃત માટે જ બની છે, એટલે એમાં દરેક ચિન્હ માટે એક અને માત્ર એક જ ધ્વનિ છે. દેવનાગરીમાં ૧૨ સ્વર અને ૩૪ વ્યંજન છે" is text, and "Алфавиты на основе кириллицы являются или являлись системой письменности для 108 естественных языков, включая следующие славянские языки" is text.
It may be bothersome text I don't understand and you can't be arsed to handle, but it's text nonetheless.
For internal use, OK, fine. But what part of the following is not "just text"? "Don’t be naïve. £100 is a lot more than €100—and Hélène knows you know that."
> The use of "i" resulted in an SMS with a completely twisted meaning: instead of writing the word "sıkısınca" it looked like he wrote "sikisince." Ramazan wanted to write "You change the topic every time you run out of arguments" (sounds familiar enough) but what Emine read was, "You change the topic every time they are fucking you"...
I agree! It's always disappointing when an invitation to contest a controversial statement outweighs the information they were actually trying to relay. Would love to seem more comments and discussion on clever applications of some of these tools.
It generally changes characters to numerical character references, not HTML character entities (which, incidentally, are a rather limited set of characters and pretty much useless these days – they were useful to include a few non-ASCII latin characters from Latin 1 in the mid-90s into HTML pages with browser charset detection messing things up if included directly; by now you just use Unicode and forget that a named set of entities ever existed).
[+] [-] LeifCarrotson|10 years ago|reply
Unicode is ideal for storing text exactly as the user wanted it. Which may be crazy.
But when I am writing a program for internal use, or creating a communication protocol, or trying to parse one program's output into another, or trying to make a 7-segment display show some characters, I don't want to have to handle these crazy possibilities. Just text is fine, thank you!
[+] [-] masklinn|10 years ago|reply
Unicode is text, that's the whole point. "ᐃᓄᐃᑦ ᐅᖃᐅᓯᓕᕆᔨᐅᑉ. ᐃᓄᑦᑎᑑᖅᐳᑦ ᓄᓇᕗᒻᒥ, ᓄᓇᕕᒻᒥ, ᐊᑯᑭᑦᑐᒻᒥ, ᓄᓇᑦᓯᐊᕗᒻᒥ, ᓄᓇᑦᓯᐊᕐᒥ' ᐊᒻᒪᓗ ᐊᓛᓯᑲᒥ" is text, "한글 또는 조선글은 1443년 조선 제4대 임금 세종이 훈민정음(訓民正音)이라는 이름으로 창제하여 1446년에 반포한 문자로, 한국어를 표기하기 위해 만들어졌다.[1][2] 이후 한문을 고수하는 사대부들에게는 경시되기도 하였으나, 조선 왕실과 일부 양반층과 서민층을 중심으로 이어지다가 1894년 갑오개혁에서 한국의 공식적인 나라 글자가 되었고, 1910년대에 이르러 한글학자인 주시경이 '한글'이라는 이름을 사용하였다. 갈래는 표음 문자 가운데 음소 문자에 속한다. 한국에서는 한글전용법이 시행되고 있다" is text, "દેવનાગરી એક લિપિ છે. સંસ્કૃત દેવનાગરી લિપિમાં લખાતી આવી છે. દેવનાગરી લિપિ મૂળ તો સંસ્કૃત માટે જ બની છે, એટલે એમાં દરેક ચિન્હ માટે એક અને માત્ર એક જ ધ્વનિ છે. દેવનાગરીમાં ૧૨ સ્વર અને ૩૪ વ્યંજન છે" is text, and "Алфавиты на основе кириллицы являются или являлись системой письменности для 108 естественных языков, включая следующие славянские языки" is text.
It may be bothersome text I don't understand and you can't be arsed to handle, but it's text nonetheless.
[+] [-] gjm11|10 years ago|reply
[+] [-] teach|10 years ago|reply
As long as there's no BOM, anyway.
[+] [-] vmorgulis|10 years ago|reply
And Unicode has a non-negligible footprint.
I tried few times to handle that in C or C++ and everything becomes complicated (even with a library like ICU).
Each time,I had to ask myself: - What's the encoding ? - How to detect it ? - What's the collation ? - ...
I read Java strings are switching to ASCII (or 16b) because it's too much inefficient.
[+] [-] peterkelly|10 years ago|reply
"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" by Joel Spolsky
http://www.joelonsoftware.com/articles/Unicode.html
[+] [-] david-given|10 years ago|reply
http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two...
> The use of "i" resulted in an SMS with a completely twisted meaning: instead of writing the word "sıkısınca" it looked like he wrote "sikisince." Ramazan wanted to write "You change the topic every time you run out of arguments" (sounds familiar enough) but what Emine read was, "You change the topic every time they are fucking you"...
[+] [-] peterkelly|10 years ago|reply
[+] [-] coldtea|10 years ago|reply
After all, the turkish encoding wasn't ASCII in the DOS/Unix years but ISO 8859-9 (and various other, e.g. an IBM one etc).
[+] [-] teach|10 years ago|reply
[+] [-] kdkooo|10 years ago|reply
[+] [-] coldcode|10 years ago|reply
[+] [-] mhuffman|10 years ago|reply
[+] [-] vmorgulis|10 years ago|reply
[+] [-] ygra|10 years ago|reply
[+] [-] alexfisher|10 years ago|reply