ASCII is by far the most successful character encoding that computers have used. It was invented in 1963, back in the era of punch cards and core memory. Modern RAM did not exist until 1975 -- a decade later.
Unicode is the replacement, not the competitor, like 64-bit IP addresses are the replacement for 32-bit IP addresses. It was developed in the early 1990s when RAM got cheap enough that you could afford two-bytes per character.
Personally, I deal with data all the time and rarely encounter unicode. Of course, I'm in the US dealing with big files out of financial and marketing databases. In fact, I've seen more EBCDIC than UNICODE.
"Modern RAM did not exist until 1975 -- a decade later."
What does that even mean? It doesn't mean DIP packaged DRAM because my dad was buying COTS Intel 1103's in 1971 or so before I was even born. And the first "I'm gonna store one bit of data in a capacitor" was done over the pond in the .uk during WWII at their code breaking plant.
"like 64-bit IP addresses are the replacement for 32-bit IP addresses."
I've worked on automated data submissions for banks in the UK, and the insistence on fixed-width, EBCDIC encoded data files for many regulatory filings (FSA, credit rating agencies) was annoying. On the other hand, it was so easy to automate in a VBA macro that I could have quite a bit of free time.
I really hate to nitpick, but the article implies that ASCII was the first character encoding. In fact, there was a rich history of different encodings before that, with different word sizes and/or incompatible 8 bit encodings. It's quite interesting to look back and see what trade-offs were made and why.
Well, it's the oldest character set and encoding that's still semi-relevant today. I doubt many people nowadays encounter EBCDIC and the like (and if they do, the article isn't aimed at them, I guess).
The fact that UTF-8 and UTF-16 are often exposed to programmers when dealing with text is a major failure of separation-of-concerns. If you had a stream of data that was gzipped, would it ever make sense to look at the bytes in the data stream before decompressing it? Variable-length text encodings are the same. Application code should only see Unicode code points.
In general it was a mistake to put variable-length encodings into the Unicode standard. A much better design would have been to use UTF-32 for the application-level interface to characters, and use a separate compression standard that is optimized for fixed alphabets when transporting or storing text. This has the advantage that the compression scheme can be dynamically updated to match the letter frequencies in the real-world text, and it logically separates the ideas of encoding and compression so that the compression container is easier to swap out. And, of course, an entire class of bugs would be eliminated from application code.
Edited first paragraph to clarify: Variable-length text encodings are the same.
I agree that that is the ideal end situation, but Unicode would have been dead on delivery if they had chosen that approach. Memory just was too expensive at the time to make a system that, in most of the computer-using world, wasted 75% in every text string. And no, just-in-time decompression wouldn't have worked either; CPU cycles also were too expensive at the time to do that.
That makes it sound as if UTF-32 would be a silver bullet - it's not. Application code normally has to deal with user-perceived characters/glyphs/grapheme clusters, which means you'll have to treat UTF-32 as variable-length as well.
If you want to get back the supposed benefits of UTF-32, you'll have to dynamically assign codepoints to grapheme clusters.
I'm impressed. Easily readable and understandable, short and as far as I can tell no factual inaccuracies and wrong information (unlike many other Unicode introductions and tutorials).
You want some criticism, because we have too little of that here on HN? I'll bite. ;)
"A byte is a set of 8 bits. Computers typically move data around a byte at a time."
A byte being 8 bits is ok. Historically, a byte might have a different number of bits, but all modern architectures use 8 bits. Since this is an introductory article, this is fine. (more details: http://en.wikipedia.org/wiki/Byte)
Computers do not typically move data in byte chunks. You could say "a byte is the smallest unit of data a CPU can load or store". If you talk about moving data, the question is between what. Probably memory. However, there are caches nowadays, since bandwidth is cheap and latency is expensive. Data is moved in cache line chunks, which means 4-64 byte chunks depending on architecture and cache level. Bigger chunks in upcoming architectures.
I think this is great explanation for anyone who's approaching the subject for the first time. It gives a good introduction as to why you're staring at broken data coming from a db and just how royally screwed your afternoon is going to be getting back into shape :)
It's worse than that, actually, as ASCII from the start¹ included provisions for variants for non-English latin characters and alternate currency symbols, and ASCII was essentially the same project as ECMA-6² (ECMA being the European Computer Manufacturers' Association³, a standardization group founded in 1961).
ASCII as we know it (which is essentially the 1967 version⁴) like the corresponding ECMA standard⁵ provided for overloading punctuation characters as diacritics ("/¨ ^/ˆ ~/˜ '/´ ‘/` ,/¸) to be overstruck in typewriter fashion; ECMA-35⁶ (1971⁷) defines further extension techniques using control and/or escape sequences.
So, yes, it's just a failed attempt at an anti-American cheap shot from someone who isn't familiar with the development of character set encodings.
I'm guessing it depends on the language. In spanish we have only one type of accent and only in vowels: áéíóú, and the ñ.
The ñ is right next to the "L" key, and the tilde (accent) is either right next to the ñ or on top of it.
This adds some sort of complexity, but in my experience, the average user simple expects the key to be "where it always was" and I've had to "fix the problem" (changing the input method) many many times.
To answer your question I think that the amount of users that actually take note of what happened and understand it enough to fix it again is minimal, the rest just expect it to work and ask for help when it doesn't.
There are also a lot of variations of spanish keyboards, so it just makes the matter more complicated... I use *unix, Windows and OSx almost interchangeably and know how to change the input language in most of them to ISO spanish spanish quite quickly, but I'm not representative in that regard.
Slightly offtopic but the spanish spanish keyboard layout is extremely comfortable for programming...
"Designed as a single, global replacement for localised character sets, the Unicode standard is beautiful in its simplicity. In essence: collect all the characters in all the scripts known to humanity and number them in one single, canonical list. If new characters are invented or discovered, no problem, just add them to the list. The list isn’t an 8-bit list, or a 16-bit list, it’s just a list, with no limit on its length."
Is this really true? My impression was that UTF-32 is a fixed-length encoding which uses 32 bits to encode all of Unicode. It seems that this means that Unicode can never have more code points than could fit in 32 bits. Right?
Unicode can never have more code points than could fit into 21 bits. Because Unicode is a 21-bit code. This has historical reasons because Unicode was a 16-bit code initially and it soon became apparent that 65536 characters are not enough, especially with the commitment of
a) compatibility to every pre-existing character set and
b) including historical scripts too¹
21 bits was what emerged from expanding UCS-2 to UTF-16 via surrogate pairs and Unicode was reörganised into 17 “planes”, the first of which, the BMP, containing all code points allocated so far. UTF-32 then just was a simple encoding scheme that allows to have one code point per code unit that is also efficient to process. 21- or 24-bit code units would be unwieldy on most architectures (especially regarding unaligned memory access).
___________
¹ Arguably the decision to include Emoji made a bigger dent in the code point space than hieroglyphs, Linear [AB], etc., though, but that came a litte later.
Unicode code points go up to 0x10FFFF, which fits in 21 bits.
If this were ever to become a problem (which I don't see happening any time soon), the transition from UCS-2 to UTF-16 is prior art on how to pull off the extension of a coding space.
Somewhat unrelated, but nevertheless worth mentioning as it's a common misconception: While UTF-32 is a fixed-length coding for Unicode characters, often, the more interesting unit is the grapheme cluster, effectively making UTF-32 into a variable-length coding.
Think you're right. The article could with a rewrite as Unicode used to be 16 bit until 1996 (according to Wikipedia) which explains why Java/Windows are really UTF-16 based.
The more interesting question is if you're designing a new operating system would you pick UTF-8 or UTF-32 as the basis of your character system. Bearing in mind you need to normalise strings anyway for comparison purposes the general space efficiencies for UTF-8 for most systems seem tempting.
I believe it depends on the encoding - UTF-32 has a limit of 2^32 characters, but UTF-8 could potentially expand infinitely (ignoring the current arbitrary 4-byte limit).
> These mappings of numbers to characters are just a convention that someone decided on when ASCII was developed in the 1960s. There’s nothing fundamental that dictates that a capital A has to be character number 65, that’s just the number they chose back in the day.
I don't think it's mere coincidence that the capital letters start at 65 and the lower case at 97 and the decimal digits at 48.
It's not a matter of winning or loosing. The pre-unicode mix of character sets was a mess when it came internationalization. Try truncating a Japanese Shift-JIS string in C. That will learn you..
Arguably Unicode (UTF-8 and -16) doesn't necessarily make this any easier. Or any variable-length encoding, really. You see halved code points quite frequently, and if not that, then halved " and the like.
OT and out of curiosity...how do non-native English speakers experience typing/keyboard education? I can barely remember how to make any of the basic accents over the `e` when trying to sound French...are typing classes in non-English schooling systems much more sophisticated than in English (i.e. ASCII-centric) schools? I wonder if non-native English typists come away with a better handling of the power of keyboard shortcuts (whether to create accents or not)
In Korea, almost all keyboard is just plain QWERTY keyboard with additional Hangul Printing.[0]
Right Alt key used as Hangul(Korean)/English toggle, right Ctrl key as Hanja.
When toggled to Hangul, only English characters are overridden by Hangul characters. All numbers, symbols are also same when you are in English typing mode.
Basically no additional key in there compared to QWERTY.
Is that complex and hard learning type in Hangul? Nope. Maybe 'Korean' is complex to learn, but 'Hangul' - I mean, script? character composition system? sort of that - is quite simple.[2]
Actually It's capable of implement more efficient input layout than English especially more restricted environment. Like basic cell phone key layout(E.161).[1]
There was a King, and He was really great hacker. Because he was a King, he grabbed bunch of smart guy all around country. ; ) Then push them working hard. (did I said he was King?) Therefore, invented many good one for country people. Today Korean has own quite good and expressive characters and he deserved quite good place.[3]
In the Netherlands we use the "US International" keyboard, which is basically qwerty like you're used to but with dead keys to make áéàüetc. We don't have that many odd characters so it's okay not to have special keys for it, but I'd agree with anyone saying that even these few are too many. If we're gonna clean the language up though, let's also get rid of other overhead. For example the sentence "If we clean language though, rid other overhead" makes quite a lot of sense.
Other countries that make even less sense and use way too many special characters (from my blunt perspective) usually have different keyboards. Prime example that I know of is France where you have to use the shift key to make numbers.
To answer what you actually asked: Yes we are taught how to make those special characters. For me it's as logic as typing parenthesis or the euro sign, but I'm spending most of my waking hours typing one thing or another. Many people don't or barely know how to.
It's not what most of the time is spent on in those courses. We're being taught the asdfjkl; row just like everyone else. Or aoeuhtns, depending on your keyboard (in the Netherlands we only have qwerty though). Making accents is more of a side thing that's mentioned once or twice after learning everything else at an acceptable speed.
Chinese has many different input methods. For example, in Taiwan most keyboards have the symbols for 4 different input methods printed on them: English/Pinyin on the top left, Zhuyin on the top right, Cangjie on the lower left, and Dayi on the lower right. The top two are phonetic, and the bottom two are symbolic.
Chinese input methods typically require a sequence of key-presses and then a selection from a menu of matching characters, with the most common matches first. Multi-character sequences can be entered without making a choice until the end, in which case the most likely n-grams come first.
Most people in Taiwan learn phonetic (Zhuyin) in school, which is very easy, as long as you know standard pronunciation.
Japanese often type in "Romaji" (a romanized transliteration of Japanese). Basically, they type phonetically in roman characters and the computer suggests/autocompletes characters which match the phonetics. It's a little like typing SMSs in T9 on old numberpad phones.
I switch (with a key combination) between a French keyboard, when I need to write text in French with accents, and an English keyboard when I need to type various types of brackets e.g. [] and {}. I actually know two different French keyboard layouts ... and can touch type fairly efficiently using any of the three. I don't consider myself a very efficient user of keyboard shortcuts.
Well, when 99% think unicode = encoding = ucs2 = utf-16, don't believe there's something outside BMP, and wtf is the only word coming to their mind when they hear about graphemes… Unicode won?
[+] [-] joshuaellinger|12 years ago|reply
Unicode is the replacement, not the competitor, like 64-bit IP addresses are the replacement for 32-bit IP addresses. It was developed in the early 1990s when RAM got cheap enough that you could afford two-bytes per character.
Personally, I deal with data all the time and rarely encounter unicode. Of course, I'm in the US dealing with big files out of financial and marketing databases. In fact, I've seen more EBCDIC than UNICODE.
[+] [-] chongli|12 years ago|reply
IPv6 addresses are 128-bit.
[+] [-] VLM|12 years ago|reply
What does that even mean? It doesn't mean DIP packaged DRAM because my dad was buying COTS Intel 1103's in 1971 or so before I was even born. And the first "I'm gonna store one bit of data in a capacitor" was done over the pond in the .uk during WWII at their code breaking plant.
"like 64-bit IP addresses are the replacement for 32-bit IP addresses."
Um...
[+] [-] lotsofcows|12 years ago|reply
[+] [-] shrikant|12 years ago|reply
I've worked on automated data submissions for banks in the UK, and the insistence on fixed-width, EBCDIC encoded data files for many regulatory filings (FSA, credit rating agencies) was annoying. On the other hand, it was so easy to automate in a VBA macro that I could have quite a bit of free time.
[+] [-] thomasjames|12 years ago|reply
[+] [-] timthorn|12 years ago|reply
[+] [-] ygra|12 years ago|reply
[+] [-] chiph|12 years ago|reply
Had the advantage that an open carrier (all zeros) mapped to NULL so you didn't waste paper (either tape or roll).
[+] [-] D9u|12 years ago|reply
[+] [-] salmonellaeater|12 years ago|reply
In general it was a mistake to put variable-length encodings into the Unicode standard. A much better design would have been to use UTF-32 for the application-level interface to characters, and use a separate compression standard that is optimized for fixed alphabets when transporting or storing text. This has the advantage that the compression scheme can be dynamically updated to match the letter frequencies in the real-world text, and it logically separates the ideas of encoding and compression so that the compression container is easier to swap out. And, of course, an entire class of bugs would be eliminated from application code.
Edited first paragraph to clarify: Variable-length text encodings are the same.
[+] [-] Someone|12 years ago|reply
Unicode also would have been too incompatible with existing code that copied 8-bit character strings around. See http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt for some rationale behind UTF-8.
[+] [-] cygx|12 years ago|reply
If you want to get back the supposed benefits of UTF-32, you'll have to dynamically assign codepoints to grapheme clusters.
[+] [-] est|12 years ago|reply
[+] [-] ygra|12 years ago|reply
[+] [-] qznc|12 years ago|reply
"A byte is a set of 8 bits. Computers typically move data around a byte at a time."
A byte being 8 bits is ok. Historically, a byte might have a different number of bits, but all modern architectures use 8 bits. Since this is an introductory article, this is fine. (more details: http://en.wikipedia.org/wiki/Byte)
Computers do not typically move data in byte chunks. You could say "a byte is the smallest unit of data a CPU can load or store". If you talk about moving data, the question is between what. Probably memory. However, there are caches nowadays, since bandwidth is cheap and latency is expensive. Data is moved in cache line chunks, which means 4-64 byte chunks depending on architecture and cache level. Bigger chunks in upcoming architectures.
[+] [-] gnosis|12 years ago|reply
http://getpython3.com/diveintopython3/strings.html
[+] [-] aidos|12 years ago|reply
[+] [-] rjh29|12 years ago|reply
[+] [-] Digit-Al|12 years ago|reply
So he thinks Americans are the only people to use the English language does he?
[+] [-] kps|12 years ago|reply
ASCII as we know it (which is essentially the 1967 version⁴) like the corresponding ECMA standard⁵ provided for overloading punctuation characters as diacritics ("/¨ ^/ˆ ~/˜ '/´ ‘/` ,/¸) to be overstruck in typewriter fashion; ECMA-35⁶ (1971⁷) defines further extension techniques using control and/or escape sequences.
So, yes, it's just a failed attempt at an anti-American cheap shot from someone who isn't familiar with the development of character set encodings.
¹ American Standard Code for Information Interchange, http://www.wps.com/projects/codes/X3.4-1963/index.html
² 7-bit Coded Character Set, http://www.ecma-international.org/publications/standards/Ecm...
³ http://www.ecma-international.org/default.htm
⁴ http://www.wps.com/J/codes/Revised-ASCII/index.html
⁵ 7-bit Input/Output Coded Character Set, 4th Edition is unfortunately the oldest available online; http://www.ecma-international.org/publications/files/ECMA-ST...
⁶ Character Code Structure and Extension Techniques, http://www.ecma-international.org/publications/standards/Ecm...
⁷ Extension of the 7-bit Coded Character Set, http://www.ecma-international.org/publications/files/ECMA-ST...
[+] [-] Trufa|12 years ago|reply
The ñ is right next to the "L" key, and the tilde (accent) is either right next to the ñ or on top of it.
This adds some sort of complexity, but in my experience, the average user simple expects the key to be "where it always was" and I've had to "fix the problem" (changing the input method) many many times.
To answer your question I think that the amount of users that actually take note of what happened and understand it enough to fix it again is minimal, the rest just expect it to work and ask for help when it doesn't.
There are also a lot of variations of spanish keyboards, so it just makes the matter more complicated... I use *unix, Windows and OSx almost interchangeably and know how to change the input language in most of them to ISO spanish spanish quite quickly, but I'm not representative in that regard.
Slightly offtopic but the spanish spanish keyboard layout is extremely comfortable for programming...
[+] [-] pfortuny|12 years ago|reply
[+] [-] peterkelly|12 years ago|reply
http://www.joelonsoftware.com/articles/Unicode.html
[+] [-] gnosis|12 years ago|reply
Is this really true? My impression was that UTF-32 is a fixed-length encoding which uses 32 bits to encode all of Unicode. It seems that this means that Unicode can never have more code points than could fit in 32 bits. Right?
[+] [-] ygra|12 years ago|reply
a) compatibility to every pre-existing character set and
b) including historical scripts too¹
21 bits was what emerged from expanding UCS-2 to UTF-16 via surrogate pairs and Unicode was reörganised into 17 “planes”, the first of which, the BMP, containing all code points allocated so far. UTF-32 then just was a simple encoding scheme that allows to have one code point per code unit that is also efficient to process. 21- or 24-bit code units would be unwieldy on most architectures (especially regarding unaligned memory access).
___________
¹ Arguably the decision to include Emoji made a bigger dent in the code point space than hieroglyphs, Linear [AB], etc., though, but that came a litte later.
[+] [-] cygx|12 years ago|reply
If this were ever to become a problem (which I don't see happening any time soon), the transition from UCS-2 to UTF-16 is prior art on how to pull off the extension of a coding space.
Somewhat unrelated, but nevertheless worth mentioning as it's a common misconception: While UTF-32 is a fixed-length coding for Unicode characters, often, the more interesting unit is the grapheme cluster, effectively making UTF-32 into a variable-length coding.
[+] [-] peteri|12 years ago|reply
The more interesting question is if you're designing a new operating system would you pick UTF-8 or UTF-32 as the basis of your character system. Bearing in mind you need to normalise strings anyway for comparison purposes the general space efficiencies for UTF-8 for most systems seem tempting.
[+] [-] alexjeffrey|12 years ago|reply
[+] [-] okwa|12 years ago|reply
I don't think it's mere coincidence that the capital letters start at 65 and the lower case at 97 and the decimal digits at 48.
[+] [-] stuartcw|12 years ago|reply
[+] [-] ygra|12 years ago|reply
[+] [-] danso|12 years ago|reply
[+] [-] kalleboo|12 years ago|reply
[+] [-] mattengi|12 years ago|reply
Right Alt key used as Hangul(Korean)/English toggle, right Ctrl key as Hanja.
When toggled to Hangul, only English characters are overridden by Hangul characters. All numbers, symbols are also same when you are in English typing mode.
Basically no additional key in there compared to QWERTY.
Is that complex and hard learning type in Hangul? Nope. Maybe 'Korean' is complex to learn, but 'Hangul' - I mean, script? character composition system? sort of that - is quite simple.[2]
Actually It's capable of implement more efficient input layout than English especially more restricted environment. Like basic cell phone key layout(E.161).[1]
There was a King, and He was really great hacker. Because he was a King, he grabbed bunch of smart guy all around country. ; ) Then push them working hard. (did I said he was King?) Therefore, invented many good one for country people. Today Korean has own quite good and expressive characters and he deserved quite good place.[3]
[0] http://i.imgur.com/j0Xk6oY.jpg
[1] http://bit.ly/11CF0mS
[2] http://blog.naver.com/PostView.nhn?blogId=neraijel&logNo=110...
[3] http://i.imgur.com/69jkSXa.jpg
[+] [-] lucb1e|12 years ago|reply
Other countries that make even less sense and use way too many special characters (from my blunt perspective) usually have different keyboards. Prime example that I know of is France where you have to use the shift key to make numbers.
To answer what you actually asked: Yes we are taught how to make those special characters. For me it's as logic as typing parenthesis or the euro sign, but I'm spending most of my waking hours typing one thing or another. Many people don't or barely know how to.
It's not what most of the time is spent on in those courses. We're being taught the asdfjkl; row just like everyone else. Or aoeuhtns, depending on your keyboard (in the Netherlands we only have qwerty though). Making accents is more of a side thing that's mentioned once or twice after learning everything else at an acceptable speed.
[+] [-] salmonellaeater|12 years ago|reply
Chinese input methods typically require a sequence of key-presses and then a selection from a menu of matching characters, with the most common matches first. Multi-character sequences can be entered without making a choice until the end, in which case the most likely n-grams come first.
Most people in Taiwan learn phonetic (Zhuyin) in school, which is very easy, as long as you know standard pronunciation.
http://en.wikipedia.org/wiki/Keyboard_layout#Taiwan
[+] [-] gilgoomesh|12 years ago|reply
[+] [-] aroberge|12 years ago|reply
[+] [-] estel|12 years ago|reply
[+] [-] lmm|12 years ago|reply
[+] [-] lelf|12 years ago|reply
[+] [-] rayiner|12 years ago|reply