Not sure if the l33tspeak analogy is fully justified.
In case of the "missing" letter (called khanda-ta in Bengali) for the Bengali equivalent of "suddenly", historically, it has been a derivative of the ta-halant form (ত + ্ + ). As the language evolved, khanda-ta became a grapheme of its own, and Unicode 4.1 did encode it as a distinct grapheme. A nicely written review of the discussions around the addition can be found here: http://www.unicode.org/L2/L2004/04252-khanda-ta-review.pdf
I could write the author's name fine: আদিত্য. A search with the string in the Bengali version of Wikipedia pulls up quite a few results as well, so other people are writing it too. The final "letter" in that string is a compound character, and there's no clear evidence that it needs to be treated as an independent one. Even while in primary school, we were taught the final "letter" in the author's name as a conjunct. In contrast, for the khanda-ta case, it could be shown that modern Bengali dictionaries explicitly referred to khanda-ta as an independent character.
For me, many of these problems are more of an input issue, than an encoding issue. Non latin languages have had to shoe-horn their script onto keyboard layouts designed for latin-scripts, and that has been always suboptimal. With touch devices we have newer ways to think about this problem, and people are starting to try out things.
[Disclosure: I was involved in the Unicode discussions about khanda-ta (I was not affiliated with a consortium member) and I have been involved with Indic localization projects for the past 15 years]
Well, yes and no. The jophola at the end is not actually given its own codepoint[0]. The best analogy I can give is to a ligature in English[1]. The Bengali fonts that you have installed happen to render it as a jophola, the way some fonts happen to render "ff" as "ff" but that's not the same thing as saying that it actually is a jophola (according to the Unicode standard).
The difference between the jophola and an English ligature, though, is English ligatures are purely aesthetic. Typing two "f" characters in a row has the same obvious semantic meaning as the ff ligature, whereas the characters that are required to type a jophola have no obvious semantic, phonetic, or orthographic connection to the jophola.
> For me, many of these problems are more of an input issue, than an encoding issue.
I think you've hit the nail on the head here. I'm a native English speaker, so I may in fact be making bad assumptions here, but I think the biggest issue here is that people conflate text input systems with text encoding systems. Unicode is all about representing written text in a way that computers can understand. But the way that a given piece of text is represented in Unicode bears only a loose relation to the way the text is entered by the user, and the way it's rendered to screen. A failure at the text input level (can't input text the way you expect), or a failure at the text rendering level (text as written doesn't render the way you expect), are both quite distinct from a failure of Unicode to accurately represent text.
FWIW there is an interesting project called Swarachakra[1] which tries to create a layered keyboard layout for Indic languages (for mobile) that's intuitive to use. I've used it for Marathi and it's been pretty great.
They also support Bengali, and I bet they would be open to suggestions.
Wait, "ত + ্ + = ৎ" is nothing like "\ + / + \ + / = W".
The Bengali script is (mostly) an abugida. Ie, consonants have an inherent vowel (/ɔ/ in the case of Bengali), which can be overriden with a diacritic representing a different vowel. To write /t/ in Bengali, you combine the character for /tɔ/, "ত", with the "vowel silencing diacritic" to remove the /ɔ/, " ্". As it happens, for "ত", the addition of the diacritic changes the shape considerably more than it usually does, but it's a perfectly legitimate to suggest the resulting character is still a composition of the other two (for a more typical composition, see "ঢ "+" ্" = "ঢ্").
As it happens, the same character ("ৎ") is also used for /tjɔ/ as in the "tya" of "aditya". Which suggests having a dedicated code point for the character could make sense. But Unicode isn't being completely nutso here.
Native Bengali here. The ligature used for the last letter of "haTaat" (suddenly) is not the same as the last ligature in "aditya" - the latter doesn't have the circle at the top.
More generally, using the vowel silencing diacritic (hasanta) along with a separate ligature for the vowel ending - while theoretically correct - does not work because no one writes that way! Not using the proper ligatures makes the test essentially unreadable.
I'm a bit confused too. There's a principled argument in Unicode for when a glyph gets its own codepoint vs when it's considered sufficient to use a combining form. I don't know Bengali at all so can't comment on this case, although given the character now is in Unicode I guess the argument changed over time. Somewhere buried in the Unicode Consortium notes is an explicit case for the inclusion / exclusion of this character, it'd be interesting to find it.
I don't understand Bengali at all, but the character ৎ does have it's own Unicode codepoint (U+09CE, BENGALI LETTER KHANDA TA). It was introduced in Unicode 4.1 in 2005.
I get the author's point, but assigning blame to the Unicode Consortium is incorrect. The Indian government dropped the ball here. They went their separate way with "ISCII" and other such misguided efforts, instead of cooperating with UC. To me, the UC is just a platform.
The government is the de facto safeguarder of the people's interests; if it drops the ball, it should be taken to task, not the provider of the platform.
There are over 100 languages spoken in India, many of which are not even Hindustani in origin. The Indian government primarily recognizes two languages: Hindi and Urdu. Additionally, Urdu is the official national language of Pakistan.
Are speakers of Bengali, Tamil, Marathi, Punjabi, etc. really supposed to depend on the Indian government to assure that the Unicode Consortium supports their native tongues? What about speakers of Hindustani/Indo-Iranian languages who do not live in India?
I don't think it's right to pin the blame on India and tell Bengali speakers that they're barking up the wrong tree. If the Unicode Consortium purports to set the standard for the encoding of human speech, then it seems to me that the responsibility should fall squarely on them.
This is a terrible excuse: the Unicode Consortium should always seek out at least one (if not a group of) native speakers of a language before defining code points for that language. These speakers really should be both native speakers of and experts in the language.
There are countless ways to reach out to Bengali speakers, only one of which is the Indian government - whatever politics governments may play, a technocratic institution should be focused on getting things correct.
> The Indian government dropped the ball here. They went their separate way with "ISCII" and other such misguided efforts, instead of cooperating with UC
According to the beginning of chapter 12 of the Unicode Standard 7.0:
"The major official scripts of India proper [...] are all encoded according to a common plan, so that comparable characters are in the same order and relative location. This structural arrangement, which facilitates transliteration to some degree, is based on the Indian national standard (ISCII) encoding for these scripts. The first six columns in each script are isomorphic with the ISCII-1988 encoding, except that the last 11 positions, which are unassigned or undefined in ISCII-1988, are used in the Unicode encoding."
I am an Indian and it shocks me that Indians are still blaming the British after 70 yrs of independence.
Is 70 years of Independence not enough to make your language "first class citizen" ? Ofcourse Bengali is second class language because Bengalis didn't invent the standard.
Can we stop blaming white people for everything. Seriously WTF.
I wonder if the author has submitted a proposal to get the missing glyph for their name added. You don't need to be a member of the consortium to propose adding a missing glyph/updating the standard. The point of the committee as I understand it isn't to be an expert in all forms of writing, but to take the recommendations from scholars/experts and get a working implementation, though more diverse representation of language groups would definitely be a positive change.
The author's explanation of what characters Chinese, Japanese, and Korean share is very limited. All three languages use Chinese characters in written language to varying extents, and in some cases the differences begin significantly less than a century ago. Though there are cases where the same Chinese character represented in Japanese writing is different from how it is represented in Traditional Chinese writing (i.e. 国, Japanese version only because I don't have Chinese installed on this PC), which could be different still from how it is represented in Simplified Chinese, there are also many instances where the character is identical across all three languages (i.e. 中). Although I am not privy to the specifics of the CJK unification project, identifying these cases and using the same character for them doesn't sound unreasonable.
Edit- To be clear, Korean primarily uses Hangul, which basically derives jack and shit from the Chinese alphabet, and Japanese uses a mixture of the Chinese alphabet, and two alphabets that sort-of kind-of owe some of their heritage to the Chinese alphabet, but look nothing like it. If they are talking about unifying these alphabets, then they are out of their minds.
I found it extremely annoying that he doesn't even say specifically what the problem with his name is. So there's a letter that's unavailable? Which letter?
Holy cow CJK unification is a terrible idea. Maybe if it originated from the CJK governments, it might be an OK idea, but the idea of a Western multinationals trying to save Unicode space by disregarding the distinctness of a whole language group is idiotic.
The fundamental roll of an institution like the Unicode Consortium is to be descriptive, not prescriptive. If there is a human script, passing certain, low, low thresholds, it should be eligible to be included in its distinct whole in Unicode.
This article is imbued with its own form of curious racism. In particular, I became suspicious of its motives at the line:
> "It took half a century to replace the English-only ASCII with Unicode, and even that was only made possible with an encoding that explicitly maintains compatibility with ASCII, allowing English speakers to continue ignoring other languages."
ASCII-compatibilty was essential to ensure the adoption of Unicode. It's not because English speakers wanted to ignore anyone or anything, it's because without it, it would never have been adopted, and we'd be in a much worse position today.
In other words, the explanation is technical, not racial.
Things like this:
"No native English speaker would ever think to try “Greco Unification” and consolidate the English, Russian, German, Swedish, Greek, and other European languages’ alphabets into a single alphabet."
The author probably ignores that different European languages used different alphabet scripts until very recently. For example, Gothic and other different script were used in lots of books.
I have old books that take something like a week to be able to read fast, and they are in German!.
But it was chaos and it unified into a single script. Today you open any book, Greek, Russian, English or German and they all use the same standard script, although they include different glyphs. There is a convention for every symbol. In cyrilic you see an "A" and a "a".
In fact, any scientific or technical book includes Greek letters as something normal.
It should also be pointed out that latin characters are not latin, but modified latin. E.g Lower case letters did not exist on Roman's empire. It were included by other languages and "unified".
Abut CJK, I am not an expert but I had lived on China, Japan and Korea, and in my opinion it has been pushed by the Governments of those countries because it has lots of practical value for those countries.
Learning Chinese characters is difficult enough. If they don't simplify it people just won't use it, when they can use Hangul or kana. With smartphones and tablets people there are not hand writing Chinese anymore.
It makes no sense to name yourself with characters nobody can remember.
Right, I was going to argue against that too. Changing fonts every letter will always look weird. There's a lot of different ways to shape these letters that are all counted as the same.
> Learning Chinese characters is difficult enough. If they don't simplify it people just won't use it
CJK unification doesn't make learning Chinese characters any easier, it only means that characters that look exactly the same will use the same code point. This is not related to simplifying characters.
Also, I regularly see people hand-writing Chinese characters on tablets, so it does happen. And for those who use Pinyin to enter characters, it doesn't matter if the characters are simplified or traditional or unified or whatever, because they only have to pick the right one in their IME.
The poo emoji is not part of Unicode because rich white old cis-gendered mono-lingual oppressive hetero men thought it'd be a fun idea (outrage now!!).
Emoji were adopted from existing cellphone "characters" in Japan, and Japan is famously lagging in Unicode adoption because some Japanese names cannot (could not?) be written in Unicode. It all just seems to be very normal (=inefficient, quirky) design by committee.
The comparison to the play "My Fair Lady" is not very convincing. I would suggest the author remove it as it weakens the argument. First, the fictional character states 'no fewer than' as an admittance that there are more. Second, even if we consider this as a valid complaint, the author himself points out in his example that this 'common sentiment' is from a play written a century ago. Using a fictional character from a time when the world's knowledge of language was incredibly smaller than it is today does not help support your goals.
Even though I'm not a native english speaker and couldn't write my name in ASCI, I really despise Unicode.
Its broken technically and a setback socially.
Unicode itself is such a unfathomably huge project that it's impossible to do it right, too many languages, too many weird writing systems, and too many ways to do mathematical notation on paper that can't be expressed.
Just look at the code pages, they are an utter mess.
Computers and ASCI were a chance to start anew, to establish english as a universal language, spoken by everybody.
The pressure on governments who would wanted to partake in the digital revolution would have forced them to introduce it as an official secondary language.
Granted english is not the nicest language, but is the best candidate we have in terms of adoption, and relative simplicity (Mandarin is another contester, but several thousand logograms are really impractical to encode.).
Take a look at the open source world, where everybody speaks english and collaborates regardless of nationality.
One of the main factors why this is possible, is that we found a common language, forced on us by the tools and programming languages we use.
If humanity wants get rid of wars, poverty and nationalism, we have to find a common language first.
A simple encoding and universal communication is a feature, fragmented communication is the bug.
Besides.
UTF-8 is broken because it doesn't allow for constant time random character access and length counting.
Why do you think English is the best candidate for the universal language, how do you define simplicity? First of all, pronunciation and spelling are almost unrelated and you have to learn them separately. That results in really different accents throughout the world. Even if you look at AmE and BrE, they differ much at the word level. Which one you want to choose? Besides, personally I find English really ambiguous and density of idioms in average text repelling, although that's only a subjective opinion.
Usage of Latin alphabet in English seems like it's on plus, but there's at least one language that uses that simple alphabet better.
> Besides. UTF-8 is broken because it doesn't allow for constant time random character and length counting.
And why you'd want that? And how do you define length? Are you a troll?
I don't understand; I don't feel like character combination using the zero width joiner is on the same level as 13375p34k. It looks like the character just doesn't have a separate code-point, but is instead a composite, but still technically "reachable" from within Unicode, no?
That's effectively the only way to write several indian language with unicode - ZWJ + NZWJ and a decent font which supports all the ligatures.
The recent Bullshit Sans font is a clear example to describe how ligatures works. And for Malayalam there are consonant patterns which do not match itself to the Sanskrit model which makes it rather odd to write half-consonants which are full syllables (വ്യഞ്ജനം +് zwj).
And my name is written with one of those (ഗോപാൽ) & I can't say I'm mad about it because I found Unicode to be elegant in another way.
Somewhere in the early 2000s, I was amazed to find out that the Unicode layouts for Malayalam as bytes in UTF-8 were sortable as-is.
As a programmer, I found that detail of encoding to sort order to be very fascinating as it meant that I had to do nothing to handle Malayalam text in my programs - the collation order is implicit, provided everyone reads the ZWJ and NZWJ in their sorting orders.
It's like typing ` + o to get "ò", isn't it? You can argue that ò is actually an o with that tilde, while that character is not ত + ্ + an invisible joining character, but that's an input method thing, and there is a ৎ character after all.
Most devanagari glyphs don't have their own codepoint. Marathi/Hindi/Sanskrit (which use devanagari) have a bunch of basic consonants and some vowels (which can appear independently or as modifiers). All the glyphs are logically formed by mixing the two, so the glyph for "foo" would be the consonant for "f"[1] plus the vowel modifier for "oo". When typing this in Unicode, you would do the same, type फ then ू, creating फू.
It gets interesting when we get to consonant clusters. As mentioned in [1], the consonants have a schwa by default, so the consonant for s followed by the consonant for k with a vowel modifier for the "y" sound would not be "sky", but instead something like "səky" (suh-ky).
So how do we write these? We can do this in two ways. One way is to use the no-vowel modifier, which looks like a straight tail[3] (on स, the consonant for "s", or "sə", the tail looks like स्), and follow that by the other consonant. So "sky" would be स् कै [2]. This method is rarely used, and the straight-tail is only used when you want to end a word with a consonant[4].
The more common way of doing multiple consonants are by using glyphs known as consonant clusters or conjuncts[5]. For "sky", we get स्कै, which is a partial glyph for स stuck to the glyph for कै. For most clusters you can take the known "partial" form of the first glyph and stick it to the full form of the second glyph, but there are tons of exceptions, eg द+द=द्ध, ह+म=ह्म (the second character was broken), and whatnot. See http://en.wikipedia.org/wiki/Devanagari#Biconsonantal_conjun... if you want a full table.
There aren't individual Unicode codepoints for this, not even codepoints for the straight-tail form of the consonants. I typed स्कै as स्+कै which was itself typed as स + ् + क + ै. This isn't an irregular occurrence either, consonant clusters (with a vowel modifier!) are pretty common in these languages[6].
I personally don't see anything wrong with having to use combining characters to get a single glyph. If it's possible and logical for it to be broken up that way, it's fine. With this trick, it's possible to represent Devanagari as a 128-codepoint block (http://en.wikipedia.org/wiki/Devanagari_%28Unicode_block%29), including a lot of the archaic stuff. It's either that, or you make a characters for every combined glyph there is, which is a lot[7]. One could argue that things like o-umlaut get their own codepoint, but स्क doesn't, but o-umlaut is one of maybe 10 such characters for a given European language, whereas स्क is one of around 700 (and that number is being conservative with the "usefulness" of the glyph, see [7]).
The article is never quite clear about which glyph Aditya finds lacking for his name (sadly, I don't know Bengali so I can't figure it out), but from the comments it seems like it it's something which can be inputted in Unicode, just not as a single character. That's okay, I guess. And if it's not showing up properly, that's a fault of the font. (And if it's hard to input, a fault of the keyboard).
It becomes a Unicode problem when:
- There is no way to input the glyph as a set of unicode code points, or
- The input method for the glyph as a set of unicode code points can also mean and look like something else given a context (fonts can only implement one, so it's not fair to blame them)
[1]: well, fə, since the consonants are schwa'd by default. Pronounced "fuh" (ish)
[2]: the space is intentional here so that I can type this without it becoming something else, but in written form you wouldn't have the space. Also it's not exactly "sky", but close enough.
[3]: called paimodi ("broken foot") in Marathi
[4]: which is pretty rare in these languages. In some cases however, words that end with consonant-vowel combinations do get pronounced as if they end with a consonant (http://en.wikipedia.org/wiki/Schwa_deletion_in_Indo-Aryan_la...), but they're still written as if they ended with a vowel (this is true for my own name too, the schwa at the end is dropped). Generally words ending with consonants are only seen in onomatopoeia and whatnot.
[5]: called jodakshar ("joined word") in Marathi
[6]: Almost as common as having two side by side consonants in English. We like vowels, so it's a bit less common, but still common.
[7]: technically infinite, though that table of 700-odd biconsonantal conjuncts would contain all the common ones (assuming we still have the vowel modifying diacritics as separate codepoints), and adding a few triconsonantal conjuncts would represent almost all of what you need to write Marathi/Hindi/Sanskrit words. It wouldn't let you represent all sounds though, unless you still have the ् modifier, in which case why not just use that in the first place?
Getting rid of CJK unification would better model actual language change in the future (France, for instance, has a group that keeps a rigorous definition of the French language up to date -- I would enjoy giving them a subset of the codes to define how to write French).
But the general principle sounds odd. Should 家, the simplified Chinese character and 家, the traditional Chinese character have different codepoints? Should no French be written using characters with lower, English code points because of their need for a couple standard characters? Should latin be written using a whole new set of code points even though it needs no code points not contained in ascii?
There was an academic proposal in the '90's for something called "multicode" (IIRC) that did exactly this: every character had a language associated with it, so there were as many encodings for "a" as there were languages that used "a", and all of them were different, or at least every character was somehow tagged so the language it "came from" was somehow identifiable.
Fortunately, it never caught on.
The notion that some particular squiggle "belongs" to one culture or language is kind of quaint in a globalized world. We should all be able to use the same "a", and not insist that we have our own national or cultural "a".
The position becomes more absurd when you consider how many versions of some languages there are. Do Australians, South Africans and Scots all get their own "a" for their various versions of English? What about historical documents? Do Elizabethan poets need their own character set? Medieval chroniclers?
Building identity politics into character sets is a bad idea. Unifying as much as practically possible is a good idea. Every solution is going to have some downsides, some of them extremely ugly, but surely solutions that tend toward homogenization and denationalization are to be preferred over ones that enable nationalists and cultural isolationists.
Re: the Han Unification debate that's going on in parallel here,
I think CJK unification makes sense from the Unicode point of view (although if they had to choose again after the expansion beyond 16-bit I doubt they'd bother with the political headache). The problem stems from the fact that only a few high-level text storage formats (HTML, MS Word, etc) have a way to mark what language some text is in. There's no way to include both Japanese and Chinese in a git commit log, or a comment on Hacker News.
Sure you can say "that's just the problem of the software developer!" but that's what we said about supporting different character sets before Unicode, and we abandoned that attitude. Hacker News is never going to add that feature on their own.
What's needed is either a "language switch" glyph in Unicode (like the right-to-left/left-to-right ones) or a layer ontop of Unicode which does implement one that gets universally implemented in OSes and rendering systems.
While it is good to bring awareness to this, we are still growing in this area. In fact we should applaud the efforts so far that we even have a standard that somewhat works for most of the digital world. Does it need to evolve further, yes.
I am sure the engineers and multi-lingual people that stepped up to do Unicode and organize it aren't trying to exclude anyone. Largely it comes down to who has been investing in the progress and time. It may even be easier to fund and move this along further in this time, it was hard to fund anything like this before the internet really hit and largely relied on corporations to fund software evolution and standards.
In no way should the engineers or group getting us this far (UC) be chided or lambasted for progressing us to this step, this is truly a case of no good deed goes unpunished.
Generally true, but the problem here is not an issue of bandwidth or racism. Unicode can represent this character, but does so with two codepoints, a technical decision the author doesn't feel is useful. He blames this on the dominance of white people in the work (a questionable assumption, given he didn't link to the extensive list of international liaisons and international standards bodies). The participants in Unicode, including a native Bengali speaker who responded above, considered the argument presented but chose a different path to be consistent with how other characters are treated. The author needs to more carefully distinguish the codepoint, input, and rendering issues raised in his argument.
It would be impossible to do. Even back in 1991 when unicode was conceived almost all the encodings in use were ASCII-compatible.
For languages that would be affected by a greco-unification that meant the encodings that were in use before unicode had both the latin script and their "national" script.
Implementing greco-unification in unicode would mean that round-trip lossless conversion (from origin encoding to unicode back into origin encoding) would be impossible, greatly limiting unicode's adoption.
No such problem existed with han characters, in fact JIS X 0208 (the character set used for Shift-JIS) did a very similar thing to unicode's han unification.
In absence of backwards compatibility problems I would be in favor of greco-unification too.
I see many comments about Han unification being a bad idea but I am not seeing any reason why it was such a bad idea. I am from a CJK country and I find it makes a lot of sense. Most commonly used characters should be considered identical regardless of whether it is used in Chinese Japanese Korean or Vietnamese. Sure there are some characters that are rendered improperly depending on your font but I don't think that makes Han unification a fundamentally bad idea.
Is it really depending on the font, or is it depending on some language metadata? Having it depend on the font seems stupid, since a font ideally be able to represent a languages which use a script encodable using Unicode.
> He proudly announces that there are ‘no fewer than 147 Indian dialects’ – a pathetically inaccurate count. (Today, India has 57 non-endangered and 172 endangered languages, each with multiple dialects – not even counting the many more that have died out in the century since My Fair Lady took place)
So, how many were there really? At the time, I mean.
I believe the number of "dialects" named in My Fair Lady can be largely explained by the lack of clear distinction between language and dialect over the years. From [1]:
"There is no universally accepted criterion for distinguishing a language from a dialect. A number of rough measures exist, sometimes leading to contradictory results. The distinction is therefore subjective and depends on the user's frame of reference."
Getting upset about Henry Higgins's estimation of the number of Indian "dialects" in a play from many decades ago doesn't make sense to me. His character was deliberately portrayed as a regressive lout, and terminology has surely changed in the intervening years.
I think the numbers are somewhat disputed. The People's Linguistic Survey of India says there are at least 780, with ~220 having died out in the last half century.[1] The Anthropological Survey of India reported 325 languages.[2]
The discrepancies are made particularly tricky because of the somewhat ambiguous distinction between languages and dialects.
The meanings of "language" and "dialect" are surprisingly tied up in politics. Aside from what various folks in India speak, consider...
In China, too, we say that people in different regions speak different dialects: the national standard Mandarin; Shanghainese; Cantonese; Taiwanese; Fukanese; and others. Someone who speaks only one of these languages will be entirely unable to speak to someone who speaks only a different one. I'm friends with a couple, the guy being from Hong Kong and the girl being from Shanghai; at home, their common tongue is English. So in what way can these different ways of speaking be considered mere dialects?
But on the other side of the coin, there are the languages of Sweden and Norway. We like to call these different languages, but a speaker of one language can readily communicate with a speaker of the other. Wouldn't these be better considered dialects of the same language? I was recently on vacation in Mexico, and at the resort there was a member of the entertainment staff who came from South Africa, a native speaker of Afrikaans. She told me that she recently helped out some guests who came from Dutch, and spoke poor English (which is usually the lingua franca when traveling). Apparently Afrikaans and Dutch are so close that she was able to translate Spanish or English into Afrikaans for them, and they were able to understand that through skills in Dutch. Again, Afrikaans and Dutch seem to be dialects of the same language (and, I think, Flemish as well).
I think the answer is that language is commonly used as a proxy for, or excuse for, dividing nations. So if you want to claim that China is all one nation, you have to claim that those different ways of speaking are just dialects of the same language. Conversely, to claim separate national identities for Norwegians and Swedes, we have to say that those are different languages.
I Can Text You A Pile of Poo, But I Can’t Write My Name
...
- by Aditya Mukerjee on March 17th, 2015
What is the glyph missing from this?
I know its not ideal but some uncommon glyphs have always been omitted from charsets, for example ASCII never included http://en.wikipedia.org/wiki/%C3%86, and it was replaced by "ae" in common language.
> He proudly announces that there are ‘no fewer than 147 Indian dialects’ – a pathetically inaccurate count.
Wow. How can a country function like this? Is everyone proficient in their native language plus a 'common' one, or are all interactions supposed to be translated inside the same country? Regardless of historical and cultural value, if that's the case, it seems... inefficient.
I do realize that there are more countries like this, but the number of languages seems way too high. I am really curious how that works.
[+] [-] sdg1|11 years ago|reply
In case of the "missing" letter (called khanda-ta in Bengali) for the Bengali equivalent of "suddenly", historically, it has been a derivative of the ta-halant form (ত + ্ + ). As the language evolved, khanda-ta became a grapheme of its own, and Unicode 4.1 did encode it as a distinct grapheme. A nicely written review of the discussions around the addition can be found here: http://www.unicode.org/L2/L2004/04252-khanda-ta-review.pdf
I could write the author's name fine: আদিত্য. A search with the string in the Bengali version of Wikipedia pulls up quite a few results as well, so other people are writing it too. The final "letter" in that string is a compound character, and there's no clear evidence that it needs to be treated as an independent one. Even while in primary school, we were taught the final "letter" in the author's name as a conjunct. In contrast, for the khanda-ta case, it could be shown that modern Bengali dictionaries explicitly referred to khanda-ta as an independent character.
For me, many of these problems are more of an input issue, than an encoding issue. Non latin languages have had to shoe-horn their script onto keyboard layouts designed for latin-scripts, and that has been always suboptimal. With touch devices we have newer ways to think about this problem, and people are starting to try out things.
[Disclosure: I was involved in the Unicode discussions about khanda-ta (I was not affiliated with a consortium member) and I have been involved with Indic localization projects for the past 15 years]
[+] [-] chimeracoder|11 years ago|reply
Author here.
Well, yes and no. The jophola at the end is not actually given its own codepoint[0]. The best analogy I can give is to a ligature in English[1]. The Bengali fonts that you have installed happen to render it as a jophola, the way some fonts happen to render "ff" as "ff" but that's not the same thing as saying that it actually is a jophola (according to the Unicode standard).
The difference between the jophola and an English ligature, though, is English ligatures are purely aesthetic. Typing two "f" characters in a row has the same obvious semantic meaning as the ff ligature, whereas the characters that are required to type a jophola have no obvious semantic, phonetic, or orthographic connection to the jophola.
[0] http://unicode.org/charts/PDF/U0980.pdf
[1] Some fonts will render (e.g.) two "f"s in a row as if they were a ligature, even though it's not a true ff(U+FB00).
[+] [-] lilyball|11 years ago|reply
I think you've hit the nail on the head here. I'm a native English speaker, so I may in fact be making bad assumptions here, but I think the biggest issue here is that people conflate text input systems with text encoding systems. Unicode is all about representing written text in a way that computers can understand. But the way that a given piece of text is represented in Unicode bears only a loose relation to the way the text is entered by the user, and the way it's rendered to screen. A failure at the text input level (can't input text the way you expect), or a failure at the text rendering level (text as written doesn't render the way you expect), are both quite distinct from a failure of Unicode to accurately represent text.
[+] [-] Manishearth|11 years ago|reply
They also support Bengali, and I bet they would be open to suggestions.
[1]: http://en.wikipedia.org/wiki/Swarachakra
[+] [-] Epenthesis|11 years ago|reply
The Bengali script is (mostly) an abugida. Ie, consonants have an inherent vowel (/ɔ/ in the case of Bengali), which can be overriden with a diacritic representing a different vowel. To write /t/ in Bengali, you combine the character for /tɔ/, "ত", with the "vowel silencing diacritic" to remove the /ɔ/, " ্". As it happens, for "ত", the addition of the diacritic changes the shape considerably more than it usually does, but it's a perfectly legitimate to suggest the resulting character is still a composition of the other two (for a more typical composition, see "ঢ "+" ্" = "ঢ্").
As it happens, the same character ("ৎ") is also used for /tjɔ/ as in the "tya" of "aditya". Which suggests having a dedicated code point for the character could make sense. But Unicode isn't being completely nutso here.
[+] [-] ubasu|11 years ago|reply
More generally, using the vowel silencing diacritic (hasanta) along with a separate ligature for the vowel ending - while theoretically correct - does not work because no one writes that way! Not using the proper ligatures makes the test essentially unreadable.
[+] [-] NelsonMinar|11 years ago|reply
[+] [-] elFarto|11 years ago|reply
[+] [-] mlmonkey|11 years ago|reply
The government is the de facto safeguarder of the people's interests; if it drops the ball, it should be taken to task, not the provider of the platform.
[+] [-] rpedroso|11 years ago|reply
Are speakers of Bengali, Tamil, Marathi, Punjabi, etc. really supposed to depend on the Indian government to assure that the Unicode Consortium supports their native tongues? What about speakers of Hindustani/Indo-Iranian languages who do not live in India?
I don't think it's right to pin the blame on India and tell Bengali speakers that they're barking up the wrong tree. If the Unicode Consortium purports to set the standard for the encoding of human speech, then it seems to me that the responsibility should fall squarely on them.
[+] [-] a_c_s|11 years ago|reply
There are countless ways to reach out to Bengali speakers, only one of which is the Indian government - whatever politics governments may play, a technocratic institution should be focused on getting things correct.
[+] [-] vorg|11 years ago|reply
According to the beginning of chapter 12 of the Unicode Standard 7.0:
"The major official scripts of India proper [...] are all encoded according to a common plan, so that comparable characters are in the same order and relative location. This structural arrangement, which facilitates transliteration to some degree, is based on the Indian national standard (ISCII) encoding for these scripts. The first six columns in each script are isomorphic with the ISCII-1988 encoding, except that the last 11 positions, which are unassigned or undefined in ISCII-1988, are used in the Unicode encoding."
[+] [-] dominotw|11 years ago|reply
Is 70 years of Independence not enough to make your language "first class citizen" ? Ofcourse Bengali is second class language because Bengalis didn't invent the standard.
Can we stop blaming white people for everything. Seriously WTF.
[+] [-] nemo|11 years ago|reply
http://unicode.org/pending/proposals.html
Also, the CJK unification project sounds horrible.
[+] [-] bbreier|11 years ago|reply
Edit- To be clear, Korean primarily uses Hangul, which basically derives jack and shit from the Chinese alphabet, and Japanese uses a mixture of the Chinese alphabet, and two alphabets that sort-of kind-of owe some of their heritage to the Chinese alphabet, but look nothing like it. If they are talking about unifying these alphabets, then they are out of their minds.
[+] [-] psychometry|11 years ago|reply
[+] [-] vilhelm_s|11 years ago|reply
[+] [-] discardorama|11 years ago|reply
Probably not. If he did, he wouldn't be able to rant about the injustices of the White Man, now, would he?
[+] [-] gerbal|11 years ago|reply
The fundamental roll of an institution like the Unicode Consortium is to be descriptive, not prescriptive. If there is a human script, passing certain, low, low thresholds, it should be eligible to be included in its distinct whole in Unicode.
[+] [-] LeoPanthera|11 years ago|reply
> "It took half a century to replace the English-only ASCII with Unicode, and even that was only made possible with an encoding that explicitly maintains compatibility with ASCII, allowing English speakers to continue ignoring other languages."
ASCII-compatibilty was essential to ensure the adoption of Unicode. It's not because English speakers wanted to ignore anyone or anything, it's because without it, it would never have been adopted, and we'd be in a much worse position today.
In other words, the explanation is technical, not racial.
[+] [-] Htsthbjig|11 years ago|reply
Things like this: "No native English speaker would ever think to try “Greco Unification” and consolidate the English, Russian, German, Swedish, Greek, and other European languages’ alphabets into a single alphabet."
The author probably ignores that different European languages used different alphabet scripts until very recently. For example, Gothic and other different script were used in lots of books.
I have old books that take something like a week to be able to read fast, and they are in German!.
But it was chaos and it unified into a single script. Today you open any book, Greek, Russian, English or German and they all use the same standard script, although they include different glyphs. There is a convention for every symbol. In cyrilic you see an "A" and a "a".
In fact, any scientific or technical book includes Greek letters as something normal.
It should also be pointed out that latin characters are not latin, but modified latin. E.g Lower case letters did not exist on Roman's empire. It were included by other languages and "unified".
Abut CJK, I am not an expert but I had lived on China, Japan and Korea, and in my opinion it has been pushed by the Governments of those countries because it has lots of practical value for those countries.
Learning Chinese characters is difficult enough. If they don't simplify it people just won't use it, when they can use Hangul or kana. With smartphones and tablets people there are not hand writing Chinese anymore.
It makes no sense to name yourself with characters nobody can remember.
[+] [-] Dylan16807|11 years ago|reply
[+] [-] gurkendoktor|11 years ago|reply
CJK unification doesn't make learning Chinese characters any easier, it only means that characters that look exactly the same will use the same code point. This is not related to simplifying characters. Also, I regularly see people hand-writing Chinese characters on tablets, so it does happen. And for those who use Pinyin to enter characters, it doesn't matter if the characters are simplified or traditional or unified or whatever, because they only have to pick the right one in their IME.
[+] [-] gurkendoktor|11 years ago|reply
[+] [-] ckoerner|11 years ago|reply
[+] [-] ticking|11 years ago|reply
Its broken technically and a setback socially.
Unicode itself is such a unfathomably huge project that it's impossible to do it right, too many languages, too many weird writing systems, and too many ways to do mathematical notation on paper that can't be expressed. Just look at the code pages, they are an utter mess.
Computers and ASCI were a chance to start anew, to establish english as a universal language, spoken by everybody.
The pressure on governments who would wanted to partake in the digital revolution would have forced them to introduce it as an official secondary language.
Granted english is not the nicest language, but is the best candidate we have in terms of adoption, and relative simplicity (Mandarin is another contester, but several thousand logograms are really impractical to encode.).
Take a look at the open source world, where everybody speaks english and collaborates regardless of nationality. One of the main factors why this is possible, is that we found a common language, forced on us by the tools and programming languages we use.
If humanity wants get rid of wars, poverty and nationalism, we have to find a common language first.
A simple encoding and universal communication is a feature, fragmented communication is the bug.
Besides. UTF-8 is broken because it doesn't allow for constant time random character access and length counting.
[+] [-] krdln|11 years ago|reply
Usage of Latin alphabet in English seems like it's on plus, but there's at least one language that uses that simple alphabet better.
> Besides. UTF-8 is broken because it doesn't allow for constant time random character and length counting.
And why you'd want that? And how do you define length? Are you a troll?
[+] [-] theon144|11 years ago|reply
[+] [-] gopalv|11 years ago|reply
The recent Bullshit Sans font is a clear example to describe how ligatures works. And for Malayalam there are consonant patterns which do not match itself to the Sanskrit model which makes it rather odd to write half-consonants which are full syllables (വ്യഞ്ജനം +് zwj).
And my name is written with one of those (ഗോപാൽ) & I can't say I'm mad about it because I found Unicode to be elegant in another way.
Somewhere in the early 2000s, I was amazed to find out that the Unicode layouts for Malayalam as bytes in UTF-8 were sortable as-is.
As a programmer, I found that detail of encoding to sort order to be very fascinating as it meant that I had to do nothing to handle Malayalam text in my programs - the collation order is implicit, provided everyone reads the ZWJ and NZWJ in their sorting orders.
[+] [-] cosarara97|11 years ago|reply
[+] [-] Manishearth|11 years ago|reply
Most devanagari glyphs don't have their own codepoint. Marathi/Hindi/Sanskrit (which use devanagari) have a bunch of basic consonants and some vowels (which can appear independently or as modifiers). All the glyphs are logically formed by mixing the two, so the glyph for "foo" would be the consonant for "f"[1] plus the vowel modifier for "oo". When typing this in Unicode, you would do the same, type फ then ू, creating फू.
It gets interesting when we get to consonant clusters. As mentioned in [1], the consonants have a schwa by default, so the consonant for s followed by the consonant for k with a vowel modifier for the "y" sound would not be "sky", but instead something like "səky" (suh-ky).
So how do we write these? We can do this in two ways. One way is to use the no-vowel modifier, which looks like a straight tail[3] (on स, the consonant for "s", or "sə", the tail looks like स्), and follow that by the other consonant. So "sky" would be स् कै [2]. This method is rarely used, and the straight-tail is only used when you want to end a word with a consonant[4].
The more common way of doing multiple consonants are by using glyphs known as consonant clusters or conjuncts[5]. For "sky", we get स्कै, which is a partial glyph for स stuck to the glyph for कै. For most clusters you can take the known "partial" form of the first glyph and stick it to the full form of the second glyph, but there are tons of exceptions, eg द+द=द्ध, ह+म=ह्म (the second character was broken), and whatnot. See http://en.wikipedia.org/wiki/Devanagari#Biconsonantal_conjun... if you want a full table.
There aren't individual Unicode codepoints for this, not even codepoints for the straight-tail form of the consonants. I typed स्कै as स्+कै which was itself typed as स + ् + क + ै. This isn't an irregular occurrence either, consonant clusters (with a vowel modifier!) are pretty common in these languages[6].
I personally don't see anything wrong with having to use combining characters to get a single glyph. If it's possible and logical for it to be broken up that way, it's fine. With this trick, it's possible to represent Devanagari as a 128-codepoint block (http://en.wikipedia.org/wiki/Devanagari_%28Unicode_block%29), including a lot of the archaic stuff. It's either that, or you make a characters for every combined glyph there is, which is a lot[7]. One could argue that things like o-umlaut get their own codepoint, but स्क doesn't, but o-umlaut is one of maybe 10 such characters for a given European language, whereas स्क is one of around 700 (and that number is being conservative with the "usefulness" of the glyph, see [7]).
The article is never quite clear about which glyph Aditya finds lacking for his name (sadly, I don't know Bengali so I can't figure it out), but from the comments it seems like it it's something which can be inputted in Unicode, just not as a single character. That's okay, I guess. And if it's not showing up properly, that's a fault of the font. (And if it's hard to input, a fault of the keyboard).
It becomes a Unicode problem when:
- There is no way to input the glyph as a set of unicode code points, or - The input method for the glyph as a set of unicode code points can also mean and look like something else given a context (fonts can only implement one, so it's not fair to blame them)
[1]: well, fə, since the consonants are schwa'd by default. Pronounced "fuh" (ish)
[2]: the space is intentional here so that I can type this without it becoming something else, but in written form you wouldn't have the space. Also it's not exactly "sky", but close enough.
[3]: called paimodi ("broken foot") in Marathi
[4]: which is pretty rare in these languages. In some cases however, words that end with consonant-vowel combinations do get pronounced as if they end with a consonant (http://en.wikipedia.org/wiki/Schwa_deletion_in_Indo-Aryan_la...), but they're still written as if they ended with a vowel (this is true for my own name too, the schwa at the end is dropped). Generally words ending with consonants are only seen in onomatopoeia and whatnot.
[5]: called jodakshar ("joined word") in Marathi
[6]: Almost as common as having two side by side consonants in English. We like vowels, so it's a bit less common, but still common.
[7]: technically infinite, though that table of 700-odd biconsonantal conjuncts would contain all the common ones (assuming we still have the vowel modifying diacritics as separate codepoints), and adding a few triconsonantal conjuncts would represent almost all of what you need to write Marathi/Hindi/Sanskrit words. It wouldn't let you represent all sounds though, unless you still have the ् modifier, in which case why not just use that in the first place?
[+] [-] hawkice|11 years ago|reply
But the general principle sounds odd. Should 家, the simplified Chinese character and 家, the traditional Chinese character have different codepoints? Should no French be written using characters with lower, English code points because of their need for a couple standard characters? Should latin be written using a whole new set of code points even though it needs no code points not contained in ascii?
[+] [-] tjradcliffe|11 years ago|reply
Fortunately, it never caught on.
The notion that some particular squiggle "belongs" to one culture or language is kind of quaint in a globalized world. We should all be able to use the same "a", and not insist that we have our own national or cultural "a".
The position becomes more absurd when you consider how many versions of some languages there are. Do Australians, South Africans and Scots all get their own "a" for their various versions of English? What about historical documents? Do Elizabethan poets need their own character set? Medieval chroniclers?
Building identity politics into character sets is a bad idea. Unifying as much as practically possible is a good idea. Every solution is going to have some downsides, some of them extremely ugly, but surely solutions that tend toward homogenization and denationalization are to be preferred over ones that enable nationalists and cultural isolationists.
[+] [-] kalleboo|11 years ago|reply
I think CJK unification makes sense from the Unicode point of view (although if they had to choose again after the expansion beyond 16-bit I doubt they'd bother with the political headache). The problem stems from the fact that only a few high-level text storage formats (HTML, MS Word, etc) have a way to mark what language some text is in. There's no way to include both Japanese and Chinese in a git commit log, or a comment on Hacker News.
Sure you can say "that's just the problem of the software developer!" but that's what we said about supporting different character sets before Unicode, and we abandoned that attitude. Hacker News is never going to add that feature on their own.
What's needed is either a "language switch" glyph in Unicode (like the right-to-left/left-to-right ones) or a layer ontop of Unicode which does implement one that gets universally implemented in OSes and rendering systems.
[+] [-] drawkbox|11 years ago|reply
While it is good to bring awareness to this, we are still growing in this area. In fact we should applaud the efforts so far that we even have a standard that somewhat works for most of the digital world. Does it need to evolve further, yes.
I am sure the engineers and multi-lingual people that stepped up to do Unicode and organize it aren't trying to exclude anyone. Largely it comes down to who has been investing in the progress and time. It may even be easier to fund and move this along further in this time, it was hard to fund anything like this before the internet really hit and largely relied on corporations to fund software evolution and standards.
In no way should the engineers or group getting us this far (UC) be chided or lambasted for progressing us to this step, this is truly a case of no good deed goes unpunished.
[+] [-] webitorial|11 years ago|reply
[+] [-] josephschmoe|11 years ago|reply
[+] [-] mmastrac|11 years ago|reply
[+] [-] EdiX|11 years ago|reply
For languages that would be affected by a greco-unification that meant the encodings that were in use before unicode had both the latin script and their "national" script.
Implementing greco-unification in unicode would mean that round-trip lossless conversion (from origin encoding to unicode back into origin encoding) would be impossible, greatly limiting unicode's adoption.
No such problem existed with han characters, in fact JIS X 0208 (the character set used for Shift-JIS) did a very similar thing to unicode's han unification.
In absence of backwards compatibility problems I would be in favor of greco-unification too.
[+] [-] wbkang|11 years ago|reply
[+] [-] unknown|11 years ago|reply
[deleted]
[+] [-] iso8859-1|11 years ago|reply
[+] [-] qntm|11 years ago|reply
So, how many were there really? At the time, I mean.
[+] [-] pc2g4d|11 years ago|reply
Getting upset about Henry Higgins's estimation of the number of Indian "dialects" in a play from many decades ago doesn't make sense to me. His character was deliberately portrayed as a regressive lout, and terminology has surely changed in the intervening years.
[1]: http://en.wikipedia.org/wiki/Dialect#Dialect_or_language
[+] [-] rpedroso|11 years ago|reply
The discrepancies are made particularly tricky because of the somewhat ambiguous distinction between languages and dialects.
[1] http://blogs.reuters.com/india/2013/09/07/india-speaks-780-l...
[2] http://books.google.com/books?id=VjGdDo75UssC&pg=PA145
[+] [-] CWuestefeld|11 years ago|reply
In China, too, we say that people in different regions speak different dialects: the national standard Mandarin; Shanghainese; Cantonese; Taiwanese; Fukanese; and others. Someone who speaks only one of these languages will be entirely unable to speak to someone who speaks only a different one. I'm friends with a couple, the guy being from Hong Kong and the girl being from Shanghai; at home, their common tongue is English. So in what way can these different ways of speaking be considered mere dialects?
But on the other side of the coin, there are the languages of Sweden and Norway. We like to call these different languages, but a speaker of one language can readily communicate with a speaker of the other. Wouldn't these be better considered dialects of the same language? I was recently on vacation in Mexico, and at the resort there was a member of the entertainment staff who came from South Africa, a native speaker of Afrikaans. She told me that she recently helped out some guests who came from Dutch, and spoke poor English (which is usually the lingua franca when traveling). Apparently Afrikaans and Dutch are so close that she was able to translate Spanish or English into Afrikaans for them, and they were able to understand that through skills in Dutch. Again, Afrikaans and Dutch seem to be dialects of the same language (and, I think, Flemish as well).
I think the answer is that language is commonly used as a proxy for, or excuse for, dividing nations. So if you want to claim that China is all one nation, you have to claim that those different ways of speaking are just dialects of the same language. Conversely, to claim separate national identities for Norwegians and Swedes, we have to say that those are different languages.
[+] [-] Torgo|11 years ago|reply
[+] [-] yarper|11 years ago|reply
What is the glyph missing from this?
I know its not ideal but some uncommon glyphs have always been omitted from charsets, for example ASCII never included http://en.wikipedia.org/wiki/%C3%86, and it was replaced by "ae" in common language.
http://en.wikipedia.org/wiki/Hanlon%27s_razor
[+] [-] fixermark|11 years ago|reply
We've even managed to build text-based search engines that do a pretty decent job of guessing which one we mean.
[+] [-] outworlder|11 years ago|reply
Wow. How can a country function like this? Is everyone proficient in their native language plus a 'common' one, or are all interactions supposed to be translated inside the same country? Regardless of historical and cultural value, if that's the case, it seems... inefficient.
I do realize that there are more countries like this, but the number of languages seems way too high. I am really curious how that works.