Interesting article. I had no idea of this phenomenon. It's very well explained too.
Tangential remark: substack is becoming as annoying as medium now. Especially on mobile. One big popup asking to register. One constant toolbar asking to register. One constant toolbar asking to install the app. Many interruptions in the main text for subscribe and share reminders.
That didn't take long for it to go bad :( I only heard about substack about a year ago when Snowden's blog was in the news and they were still saying they'd keep it clean (like medium promised initially as well). And it was pretty clean then.
I was even thinking of putting my own blog there (which is free and unmonetized) but no.
On medium it's become so bad now that I don't even open their links anymore unless it's really something I am so curious about I'm willing to put up with the experience. I really hope substack doesn't go the same way.
Sure they have to make money but alienating your userbase doesn't seem a great way to do so in the long term.
It's not a "Tangential remark" if your tangent is 10 times longer than your commentary on the article!
> Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
> Tangential remark: substack is becoming as annoying as medium now. [...] On medium it's become so bad now that I don't even open their links anymore.
FWIW I find that Reader Mode works fine for making posts on substack and medium interruption-free, both on desktop (Firefox) and mobile (Safari).
I realise it's possible that either of them might try some Reader Mode defeating hijinks in the future, so you still might want to avoid putting your blog there if you don't want that possibility looming beyond the horizon. But when it comes to reading existing stuff that other people have written, an accessible version should be just one or two clicks away.
> This led me down a rabbit hole of trying to find a font that could render all of the language scripts. I went on Google Fonts to find this perfect font and found that one did not exist.
This seems like a rather odd example of "technological disparity"; it's just modular and "one huge font file" isn't in wide use because it's unwieldy.
Install/add an additional Chinese font, or Tamil font, etc. as needed. Most use cases don't need "all the scripts" and a modular approach is much better as fonts are large: NotoSans-Regular is 590K; NotoSansCJK-Regular is 26M, etc. In total Noto Fonts is 373M on my system.
And it contains hundreds of thousands of different glyphs, and requires expertise on dozens of writing systems. Creating such a font is a significant effort, which is why few "universal" fonts exist, and why there are many "for this script" fonts.
---
I wonder how one could design a better Morse code for Chinese; Japanese Morse code used Kana, rather than Kanji, and Hangul can be composed from smaller blocks. As near as I can find, Chinese is kind of an outlier here. Any system I can think of would probably be equally difficult to use or error-prone (due to either operator error or line noise mangling things).
Written Chinese can be thought of as a syllable alphabet with 100s of ways to write each syllable. For a fluent reader it is easier to read with those contextual hints, but strictly speaking it is not necessary.
Spoken Chinese works just fine without them.
Morse code usually has its own vernacular so it is easy to get around the lack of characters.
I wonder how one could design a better Morse code for Chinese
In terms of efficiency, I guess you might start by ranking characters by frequency, and build a Huffman code? Then think about adding parity bits or sync symbols or whatnot.
It's hard to imagine people learning it, rather than painstakingly looking up each character in a book, but I suppose it'd be similar to learning any other numeric mapping. Wikipedia says "Chinese expert telegraphers used to remember several thousands of codes of the most frequent use".
I actually ran into this myself when I was testing my C program for UTF-8 support. Ended up just installing noto CJK but I wondered why there wasn’t a universal font I could use. Now I’m wondering why I can’t just fuse a font that has cjk chats to a font that has Latin ones.
Missing glyphs would be a much smaller problem if people stop creating/using text renderers that don’t support font fallback and system font store. There are still a good number of GUI frameworks that assume one font or even one Latin font is enough. Fyne (golang) is an example; the advice, if you want to display CJK text, is to bundle a single font with all the glyphs you need…
Languages are subject Metcalfe's Law. In the long run - let's say a thousand years - it's likely that the majority of humans will speak a single language. And that language will most likely be derived from English.
The only thing that kept languages separate, historically, was isolation. The internet has fixed that. If you want to publish content to the widest audience, you publish in English. If you want to consume that knowledge, you'd better understand English. The network effect is powerful and English has a substantial lead.
Maybe it's the seed planted by British empire. Maybe it's the fact that English fits into 7-bit ascii. Maybe it's the fact that English is already a hodge-podge of germanic and romance languages. Maybe it has something to do with English readily adopting neologisms from other languages. Whatever the historical reasons, if right this second you put a random group of people from different non-English-speaking countries together, they're probably going to talk to each other in English.
So yeah, this problem - if you think it's a problem - is going to get worse over time. But there's nothing you can do about it in the long run.
From what I know nobody in China cares about English and that's a nation like 2-3x bigger than all population of English-speaking countries :) Can't see a demise of Chinese even after 1,000 years. As well other languages with big enough user base (I'm not speaking about smallish European countries)
This is no surprise, the tokenizer is constructed to minimize the encoded length of the training corpus, and most of that is in English or at least using the Latin alphabet.
This is probably not entirely invalidating the result, but the language samples in the dataset seem to be extremely badly translated from english, with unnatural, verbose and grammatically wrong sentences. That would not help with good tokenisation.
For example the english text
> please add milk to the grocery list
Is compared to the french text
> s'il vous plaît ajouter du lait à la liste d' épicerie
But a native would say
> veuillez ajouter du lait à la liste de courses
This is a really good point! I also noticed that some of the translations were not good or very stilted for the languages I do speak. However, this is a limitation of the dataset of this size and breadth
I wonder how humans decode these symbols, because it doesn’t seem to be 10x more “costly” for a person to natively learn one of those languages vs another
> to express the same sentiment, some languages require up to 10 times more tokens
this does not make sense, does it? it may be a true metric as per the setup of the comparison, the existing models, the existing corpus etc. but logically it seems an artifact rather than something deep about language information density.
nevertheless it seems worth investigating. I would suspect that once various irrelevant biases are removed (a sort of ur-LLM) there will be an interesting comparative landscape.
When talking about token length, I couldn't help but wonder if they were judging length in UTF-8 bit size, in which case languages using non-Latin alphabets (and even those that do, but with accents) would pay a penalty
Most LLMs are trained using subword-tokenization such as BPE (which is investigated in this post) and Sentencepiece.
These algorithms minimize the number of tokens required for representing the training corpus. I.e. for a training set mostly consisting of English this is a natural consequence.
In my opinion a more interesting question would be to ask if chatGPT performs better or worse on languages with "unique" sets of characters (like Burmese & Amharic?) compared to other European languages (like French & German) which might tokenize to shorter lengths but share subwords with English while having different meanings.
Also, OpenAI being an American company and training a model which is optimised for English seems very natural... Just query it in English for better and cheaper results. If it would be equally good at 200 different languages it would probably be bad at all of them instead.
I’ve talked to folks whose native language is represented on the right side of the distribution and GPT4 performs poorly both in terms of speed as well as language facility for these languages. Interestingly the right side tends to be largely south east and south Asian languages. Malay is an outlier in that it tokenizes fairly small. But Burmese, Khmer, and Thai perform poorly.
I’ve tried using chatgpt on Tamil and it’s firstly hella slow and secondly can’t do much with it beyond a few hundred words. I figured it’s considering each letter in the sentence as a token, but originality and inventiveness wise it wasn’t necessarily worse.
Curious to know if it’s more than just a representational problem. Are some languages harder in some deep way than others? And with what human consequences, let alone the LLM costs?
I don't think languages are significantly easier or harder, there's a limit on how hard it can get for average person to be able to speak it comfortably. I think there are different tradeoffs taken and if you come from a language with similar tradeoffs it's easier to learn.
My native language is Polish, I know English and a little German and Spanish. Slavic languages often have the reputation of being difficult, but IMHO that's because they are easy in places English-speakers expect to be difficult, and difficult in places English-speakers expect to be easy.
There's 15 tenses in English and 3 in Polish. There's no articles in Polish, and the pronunciation is almost perfectly regular. And there's probably 20 times fewer word roots, because of the pre/post-fix system. What in English is 20 unrelated words in Polish is one word root + 20 different combinations of pre/post fixes :)
But to take advantage of this when you're learning you have to think in the language you are learning - to realize these words are related and how the postfixes modify the meaning. Otherwise you'll still need to memorize 20 separate words - and on top of that all that crap that is harder in Polish, like cases.
I wonder if this influences LLMs (for example if they "think in Polish" when producing Polish text, or "think in English" and translate on the fly). I noticed GPT-3 was much better at rhyming in English than in Polish, despite the fact that rhyming in Polish is very easy (if the final letters match
- it rhymes). When I explained this rule to it - it started rhyming better :)
Some languages have some specifics that are not common in other languages, and if the language is small, it's hard to account for that too.
In slovene, you have singular, plural but also dual forms, so even the basic "strings.xml" types of localizations don't work:
eg: "I eat" would be "Jaz jem", if it was two of us the "We eat" would be "Midva jeva", and if 3+ of us would be eating, it would be "Mi jemo".
Also when counting, we have a different form for one thing, two things, three-or-four things and five+ things, So, "(1-5) beer/s" would be "1 pivo, 2 pivi, 3 piva, 4 piva, 5 piv"
No. It's because OpenAI's token model is highly biased towards English words. In this way, other languages are greatly disadvantaged, reducing competitiveness in non-English-speaking countries. In my language (Portuguese and Spanish), for example, my cost increases by 60% to 75% due to the content I need to process not being in English.
No its not. Greek are tokenized on a per character level, while the language structure while not as simple as english or latin languages in General, is not that different.
I have a suspicion that LLMs has primary language that it must be always consistent with. If so it makes sense that it performs worse in non-native languages.
I believe it comes down to the statistical frequency in the dataset used to build the tokenizer. If the dataset is 90% Burmese the resulting tokenizer will have single tokens representing common complex constructs in Burmese instead of English. Today's tokenizer might tokenize Burmese letter-by-letter or even byte-by-byte as it is not a very common language on the internet.
How about parsing difficulty? Some languages are more context dependent than others.
In programming, C++ takes probably 10x more cycles to compile than simpler languages. There are so many possible interpretations of each statement, the correct one of which depends on context.
It's interesting that the token length for Chinese is only slightly longer than for English. What does tokenization of an ideographic language look like, anyway? One token per ideograph? Something else?
Either one token per radical (as some minimalist proposals for CJK Unicode normalization suggested way back when) — or just map everything to sense-annotated pinyin, i.e. what you type into a Chinese IME to get ideographs out (and which is also, I think, what Chinese text-to-speech engines do internally as an intermediate step.)
Interesting. This sort of thing could be solved by having a transformer work with intermediate and language-less concepts/ideas and then translating to/from a language with an additional model/encode/decoder separately.
I think the impact of ensuring "fairness" with language models at this point in time of their development would be quite negative. Does every model need to support Burmese? How large does a company offering a model have to be before it's considered a requirement? OpenAI's homepage (https://openai.com/) only seems to be available in English, these transformers are quite new, why haven't we applied the same logic of fairness & inclusivity to every site on the internet? Because it's infeasible unless usage calls for it, I don't believe it's political.
Just on the face of it, as someone who has implemented a handful of tokenizers over time and can read Hangul, the Korean example there it four tokens, not twelve. You can break that down more with semantic parsing, but tokenization is not semantic parsing.
Can't this be explained by the special charecters and the extra step to translate them, such as Diccionario Español changes to xn--diccionarioespaol-txb in a domain. So the extra step adds a level of complexity/compute?
[+] [-] wkat4242|2 years ago|reply
Tangential remark: substack is becoming as annoying as medium now. Especially on mobile. One big popup asking to register. One constant toolbar asking to register. One constant toolbar asking to install the app. Many interruptions in the main text for subscribe and share reminders.
That didn't take long for it to go bad :( I only heard about substack about a year ago when Snowden's blog was in the news and they were still saying they'd keep it clean (like medium promised initially as well). And it was pretty clean then.
I was even thinking of putting my own blog there (which is free and unmonetized) but no.
On medium it's become so bad now that I don't even open their links anymore unless it's really something I am so curious about I'm willing to put up with the experience. I really hope substack doesn't go the same way.
Sure they have to make money but alienating your userbase doesn't seem a great way to do so in the long term.
[+] [-] prions|2 years ago|reply
> Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
[+] [-] Karellen|2 years ago|reply
FWIW I find that Reader Mode works fine for making posts on substack and medium interruption-free, both on desktop (Firefox) and mobile (Safari).
I realise it's possible that either of them might try some Reader Mode defeating hijinks in the future, so you still might want to avoid putting your blog there if you don't want that possibility looming beyond the horizon. But when it comes to reading existing stuff that other people have written, an accessible version should be just one or two clicks away.
[+] [-] runako|2 years ago|reply
https://www.theverge.com/2023/4/7/23674178/substack-burn-202...
[+] [-] _a_a_a_|2 years ago|reply
[+] [-] alanbernstein|2 years ago|reply
[+] [-] arp242|2 years ago|reply
This seems like a rather odd example of "technological disparity"; it's just modular and "one huge font file" isn't in wide use because it's unwieldy.
Install/add an additional Chinese font, or Tamil font, etc. as needed. Most use cases don't need "all the scripts" and a modular approach is much better as fonts are large: NotoSans-Regular is 590K; NotoSansCJK-Regular is 26M, etc. In total Noto Fonts is 373M on my system.
And it contains hundreds of thousands of different glyphs, and requires expertise on dozens of writing systems. Creating such a font is a significant effort, which is why few "universal" fonts exist, and why there are many "for this script" fonts.
---
I wonder how one could design a better Morse code for Chinese; Japanese Morse code used Kana, rather than Kanji, and Hangul can be composed from smaller blocks. As near as I can find, Chinese is kind of an outlier here. Any system I can think of would probably be equally difficult to use or error-prone (due to either operator error or line noise mangling things).
[+] [-] masklinn|2 years ago|reply
Not “can be”, “is”. Hangul was designed, it’s alphabetic and morpho-syllabic features were intentional.
[+] [-] wodenokoto|2 years ago|reply
Written Chinese can be thought of as a syllable alphabet with 100s of ways to write each syllable. For a fluent reader it is easier to read with those contextual hints, but strictly speaking it is not necessary.
Spoken Chinese works just fine without them.
Morse code usually has its own vernacular so it is easy to get around the lack of characters.
[+] [-] iainmerrick|2 years ago|reply
In terms of efficiency, I guess you might start by ranking characters by frequency, and build a Huffman code? Then think about adding parity bits or sync symbols or whatnot.
It's hard to imagine people learning it, rather than painstakingly looking up each character in a book, but I suppose it'd be similar to learning any other numeric mapping. Wikipedia says "Chinese expert telegraphers used to remember several thousands of codes of the most frequent use".
[+] [-] Decabytes|2 years ago|reply
[+] [-] oefrha|2 years ago|reply
[+] [-] geokon|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] stickfigure|2 years ago|reply
The only thing that kept languages separate, historically, was isolation. The internet has fixed that. If you want to publish content to the widest audience, you publish in English. If you want to consume that knowledge, you'd better understand English. The network effect is powerful and English has a substantial lead.
Maybe it's the seed planted by British empire. Maybe it's the fact that English fits into 7-bit ascii. Maybe it's the fact that English is already a hodge-podge of germanic and romance languages. Maybe it has something to do with English readily adopting neologisms from other languages. Whatever the historical reasons, if right this second you put a random group of people from different non-English-speaking countries together, they're probably going to talk to each other in English.
So yeah, this problem - if you think it's a problem - is going to get worse over time. But there's nothing you can do about it in the long run.
[+] [-] fooker|2 years ago|reply
2000 years ago, it would have been Sanskrit.
Predicting something 1000 years into the future is tricky business.
It takes one calamity/war/etc to tilt the balance.
If Yellowstone erupts in 200 years, it's unlikely English would remain the dominant language. Maybe Mandarin, who knows.
[+] [-] Ambix|2 years ago|reply
[+] [-] Levitz|2 years ago|reply
[+] [-] ruuda|2 years ago|reply
[+] [-] fvdessen|2 years ago|reply
For example the english text > please add milk to the grocery list
Is compared to the french text > s'il vous plaît ajouter du lait à la liste d' épicerie
But a native would say > veuillez ajouter du lait à la liste de courses
[+] [-] yenniejun111|2 years ago|reply
[+] [-] nico|2 years ago|reply
I wonder how humans decode these symbols, because it doesn’t seem to be 10x more “costly” for a person to natively learn one of those languages vs another
Also, supposedly, independent of the language, humans communicate at an aprox constant rate (39 bits/second according to this: https://www.science.org/content/article/human-speech-may-hav...)
—
Here’s the Twitter thread about it from the same author:
https://twitter.com/yenniejun/status/1653791622197579776?s=4...
[+] [-] nologic01|2 years ago|reply
this does not make sense, does it? it may be a true metric as per the setup of the comparison, the existing models, the existing corpus etc. but logically it seems an artifact rather than something deep about language information density.
nevertheless it seems worth investigating. I would suspect that once various irrelevant biases are removed (a sort of ur-LLM) there will be an interesting comparative landscape.
[+] [-] pcthrowaway|2 years ago|reply
When talking about token length, I couldn't help but wonder if they were judging length in UTF-8 bit size, in which case languages using non-Latin alphabets (and even those that do, but with accents) would pay a penalty
[+] [-] Lelleander|2 years ago|reply
These algorithms minimize the number of tokens required for representing the training corpus. I.e. for a training set mostly consisting of English this is a natural consequence.
In my opinion a more interesting question would be to ask if chatGPT performs better or worse on languages with "unique" sets of characters (like Burmese & Amharic?) compared to other European languages (like French & German) which might tokenize to shorter lengths but share subwords with English while having different meanings.
Also, OpenAI being an American company and training a model which is optimised for English seems very natural... Just query it in English for better and cheaper results. If it would be equally good at 200 different languages it would probably be bad at all of them instead.
[+] [-] fnordpiglet|2 years ago|reply
[+] [-] ramraj07|2 years ago|reply
[+] [-] asplake|2 years ago|reply
[+] [-] ajuc|2 years ago|reply
My native language is Polish, I know English and a little German and Spanish. Slavic languages often have the reputation of being difficult, but IMHO that's because they are easy in places English-speakers expect to be difficult, and difficult in places English-speakers expect to be easy.
There's 15 tenses in English and 3 in Polish. There's no articles in Polish, and the pronunciation is almost perfectly regular. And there's probably 20 times fewer word roots, because of the pre/post-fix system. What in English is 20 unrelated words in Polish is one word root + 20 different combinations of pre/post fixes :)
But to take advantage of this when you're learning you have to think in the language you are learning - to realize these words are related and how the postfixes modify the meaning. Otherwise you'll still need to memorize 20 separate words - and on top of that all that crap that is harder in Polish, like cases.
I wonder if this influences LLMs (for example if they "think in Polish" when producing Polish text, or "think in English" and translate on the fly). I noticed GPT-3 was much better at rhyming in English than in Polish, despite the fact that rhyming in Polish is very easy (if the final letters match - it rhymes). When I explained this rule to it - it started rhyming better :)
[+] [-] ajsnigrutin|2 years ago|reply
In slovene, you have singular, plural but also dual forms, so even the basic "strings.xml" types of localizations don't work:
eg: "I eat" would be "Jaz jem", if it was two of us the "We eat" would be "Midva jeva", and if 3+ of us would be eating, it would be "Mi jemo".
Also when counting, we have a different form for one thing, two things, three-or-four things and five+ things, So, "(1-5) beer/s" would be "1 pivo, 2 pivi, 3 piva, 4 piva, 5 piv"
So yeah, good luck :)
[+] [-] rsiqueira|2 years ago|reply
[+] [-] v4dok|2 years ago|reply
[+] [-] numpad0|2 years ago|reply
[+] [-] qeternity|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] pkaye|2 years ago|reply
[+] [-] snordgren|2 years ago|reply
[+] [-] HPsquared|2 years ago|reply
In programming, C++ takes probably 10x more cycles to compile than simpler languages. There are so many possible interpretations of each statement, the correct one of which depends on context.
[+] [-] Animats|2 years ago|reply
[+] [-] derefr|2 years ago|reply
[+] [-] compsciphd|2 years ago|reply
[+] [-] Narishma|2 years ago|reply
[+] [-] justinclift|2 years ago|reply
[+] [-] fennecfoxy|2 years ago|reply
I think the impact of ensuring "fairness" with language models at this point in time of their development would be quite negative. Does every model need to support Burmese? How large does a company offering a model have to be before it's considered a requirement? OpenAI's homepage (https://openai.com/) only seems to be available in English, these transformers are quite new, why haven't we applied the same logic of fairness & inclusivity to every site on the internet? Because it's infeasible unless usage calls for it, I don't believe it's political.
[+] [-] 13of40|2 years ago|reply
[+] [-] agluszak|2 years ago|reply
[+] [-] gizajob|2 years ago|reply
[+] [-] nadermx|2 years ago|reply