top | item 35514623

Why is GPT-3 15.77x more expensive for certain languages?

117 points| rayshan | 3 years ago |denyslinkov.medium.com | reply

149 comments

order
[+] lukeschlather|3 years ago|reply
I would want to see some data on tokenization for some real-world examples. "Je voudrais une pizza" actually translates more directly to "I would like a pizza" which is 5 tokens. But also I think there's some danger here in terms of this might be cherrypicking examples. Spanish is a lot more dense than English or French and might tokenize better. (I see "quiero pizza" is 4 tokens which seems like the right number of tokens to me - "quiero" actually contains "I want <present tense>") You could argue it's 2 or 3 tokens but 4 seems preferable.

For diacratics in French or Spanish, diacratics are logically characters. I can't think of an example where it's actually useful to split the letter into a different token but I could see it happening and not being harmful. I do think it's possible French is just weird and just needs more tokens. When I think about how I process French, I probably do treat e.g. "Je l'ai aimé" as a pathological example as 3 tokens when I speak it out loud. But I can also see why you would tokenize it as 6 tokens, I'm not sure that's Anglocentrism so much as it's recognizing a complexity difference between French and English writing.

But all this is contrast to how non-roman characters are tokenized at the byte level. That just seems bad and like it's definitely going to make it worse with non-roman languages. There's no point in having tokens that split characters.

[+] function_seven|3 years ago|reply
> Spanish is a lot more dense than English or French and might tokenize better.

I'm no linguist, so I apologize if I'm misinterpreting this statement. My impression has always been that Spanish is less dense than English, only because in almost all cases, the Spanish version of product instructions is wordier. Look at the back of a shampoo bottle[0] and notice that the Spanish version is either longer, or a smaller font, to fit it all.

[0] https://i.postimg.cc/xd2X5WJN/Ghub-Fo-N11u8jz-Pjj-RDt-W-CGA9...

[+] kouteiheika|3 years ago|reply
Slightly offtopic, but:

> One of the models listed above called NLLB (No Language Left Behind) has been open sourced by Facebook allowing for translation for 200 languages.

It was not. The model's weights are under CC-BY-NC, which certainly motivates commercial entities to not leave those languages behind. /s

[+] adsfoiu1|3 years ago|reply
It was open sourced, just under a non-commercial license.
[+] arthurcolle|3 years ago|reply
Everyone I know ripped those LLaMA leak models and are using them extensively in "open source code" / commercial products - super unwise, but not sure licensing is actually slowing down progress in this field and even though I'm sure OpenAI is using alternative methods to make the language stuff work so well, I just wanted to comment on that front.

I wouldn't release a chatbot based on LLaMA 65B, because of the legal issues, I'm not sure others are using the same restraint.

[+] simsla|3 years ago|reply
That's still open sourced.
[+] FredPret|3 years ago|reply
What an interesting aspect I haven't considered before. All the AIs will be trained on the available media - most of which is English.

I sometimes wonder what it takes to unseat a lingua franca, but it looks like we won't see that soon. English is set to dominate for a long time.

[+] famouswaffles|3 years ago|reply
Doesn't really matter. There's lots of positive transfer in individual language learning. Competence in one language bleeds into competence in others. https://arxiv.org/abs/2108.13349

GPT-3 is fluent in many languages despite English taking up 93% of the corpus by word count. French is next with 1.8%

https://github.com/openai/gpt-3/blob/master/dataset_statisti...

Dunno the statistics of language presence with GPT-4 but it takes it up another level in terms of its multilingual capabilities.

[+] rapsey|3 years ago|reply
ChatGPT speaks a ton of languages and very well at that. Hell it is better at my native language than I am and I am from a pretty small country.
[+] hirundo|3 years ago|reply
This may be backwards. When AI can cheaply, quickly and with nuance intact translate between languages, it becomes easier to use a preferred non-dominant language, which would make English less dominant. There's less incentive to spend so much time learning this oddly irregular foreign tongue if the skill is embedded in your phone.
[+] lucb1e|3 years ago|reply
> All the AIs will be trained on the available media - most of which is English.

Is it?

[+] ecshafer|3 years ago|reply
FWIW Chinese tech companies have a lot of stuff also that is really impressive like WuDao 2.0. They just don't get the same amount of press.
[+] mutagen|3 years ago|reply
I'm wondering about this in the context of new programming languages. If people are using LLMs to learn a new language, will a new programming language be at a disadvantage until there's a critical mass of code, comparisons to existing languages, Rosetta Stone style examples, etc?
[+] vintermann|3 years ago|reply
> All the AIs will be trained on the available media - most of which is English.

Are you sure about that? Most of the media we see, sure, but there has been, and still is a lot of media being produced in other languages.

[+] galaxytachyon|3 years ago|reply
So what I got from this is that GPT was trained on a dataset that biased in English contents. Is that right?

I think even human has to spend extra energy to speak a language they were not born with, no matter how fluent they are in this language. I don't know about natural multilinguals.

[+] terafo|3 years ago|reply
Nope, it's not about dataset. It's just bad tokenizer. Korean has couple of dozen of symbols in it's alphabet. Cyrillic languages have less than 50 symbols in total. Hiragana is 46 symbols. GPT-4 has 32k tokens IIRC. Including most significant alphabets would take less than a thousand.
[+] tlrobinson|3 years ago|reply
I think yes, but more precisely the tokens were chosen to optimize training on a dataset that's biased to English content.

I am curious how the token set affects quality of responses, ignoring the factors related to token count mentioned in the post (cost, prompt expressivity, latency, etc)

Is it always better for the token set to be "native" to the majority of the training dataset and prompts/completions, or is it possible there's some "intermediate representation" (in compiler terms) that would be better?

[+] famouswaffles|3 years ago|reply
Data used to train the tokenizer is entirely separate from data training the LLM.

The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.

GPT-4's tokenizer is already far more efficient though still weighted to English.

You can test it here https://tiktokenizer.vercel.app/

[+] kevingadd|3 years ago|reply
Should the cost really be 15x? Or even 5x? In this case, it's not even a question of whether the network is better at English, it's that the cost to communicate with it at all in other languages is higher. Once you pay that cost you now have to deal with the network potentially generating lower quality results for prompts in non-English languages too, which raises the actual cost of doing something with GPT beyond 15x since you probably will need more attempts.
[+] wolfium3|3 years ago|reply
You can use their online tool to see how it tokenizes words: https://platform.openai.com/tokenizer
[+] minimaxir|3 years ago|reply
It's worth noting that this only for GPT-3. If you're using ChatGPT or GPT-4, both use a different tokenizer that's more robust and uses/generates about 10% fewer tokens. (unclear how well it performs for non-English languages)

You can test it offline using tiktoken: https://github.com/openai/tiktoken

[+] karmoka|3 years ago|reply
"Je voudrais une pizza" is better translated to "I would like a pizza" "I want a pizza" would be "je veux une pizza"
[+] idleproc|3 years ago|reply
"J’aimerais..." is better translated to "I would like...", n'est-ce pas?
[+] bob1029|3 years ago|reply
If you think about this from a "language is computation" perspective, it starts to get even more interesting.

For example, what would the real-world performance of ChatGPT be if we had trained it predominantly on German or Korean text?

Is English actually the best language/structure for this system?

[+] biztos|3 years ago|reply
Maybe there’s a competitive advantage in training a new one on just German, say, and unleashing it on automotive engineering problems?
[+] wordpad25|3 years ago|reply
HUGE SALE! Save 93% OFF on GPT API by translating prompt into English first!!!
[+] rubywilde|3 years ago|reply
Actually, it is not true. Hilarious

Author compares different encoders: for Facebook's NLLB and GPT2. Where did title came from?

Another point is that OpenAI changed encoders for chat models. Link: https://github.com/openai/openai-cookbook/blob/main/examples...

Now English is less optimized for tokens usage and other languages are much more balanced. E.g. Ukrainian takes only twice as much tokens, before it had 6 times more tokens

[+] FrostKiwi|3 years ago|reply
So glad someone took the time to put up some data about it. Since day one, the subpar results for Asian languages has stuck out to me. It's especially true for LLama-derived models, where the output is just abysmal. It's my own pet theory, that bad tokenization is an important reason as to why they suck so much in the first place.

It's not just broken grammar, it's a surprising lack of creativity, that English doesn't suffer from. ChatGPT English -> DeepL and fixing the auto-translation gives vastly improved results, than prompting ChatGPT to respond in an asian language.

[+] mgaunard|3 years ago|reply
So for latin languages, they tokenize per word, and somehow for asian languages, it's tokenizing per radical.

Of course you'd end up with a lot more tokens. Just tokenize by word regardless of language.

[+] k8si|3 years ago|reply
"word" isn't a useful concept in a lot of languages. Words are obvious in English because English is analytic: https://en.wikipedia.org/wiki/Analytic_language

But there are tons of languages (not just CJK languages) that use either compounding or combos of root + prefix/suffix/infix to express what would be multiple words in English. E.g. German 'Schadenfreude'. Its actually way more useful to tokenize this as separate parts because e.g. 'Freude' might be part of a lot of other "words" as well. So you can share that token across a lot of words, thereby keeping the vocab compact.

[+] crazygringo|3 years ago|reply
Words aren't an equivalent count between languages either. English uses a lot of helper words, some other languages use multiple suffixes. Chinese characters don't even make it clear where "word" boundaries are -- there are no spaces.
[+] jltsiren|3 years ago|reply
It's more like some big languages receive special treatment, while everything else is interpreted as a byte stream. In Finnish language, the tokens seem to be arbitrary substrings of average length 3-4, and they rarely correspond to any semantically or grammatically meaningful units.
[+] Imnimo|3 years ago|reply
Setting aside the specific choice of tokenizer for GPT models, I'm curious how much difference in performance is made by the features of the human language used to represent the training data. Like if you kept the exact same training corpus and could wave a magic wand and translate it into any language and could create a custom tokenization for each language, would some be more amenable than others to GPT-style language modeling?
[+] startupsfail|3 years ago|reply
I’m finding it amazing that the model comes localized and supports obscure languages and is available. Compare this to traditional software. Or even to web software. Does Google come localized to all of these languages, for example?

Yes, there is overhead from localization. So what, this overhead was always there for software.

[+] jinushaun|3 years ago|reply
The French example is strange and shows that the language model has an English bias.

  - “I want a pizza” = 4 tokens
  - “Je voudrais une pizza” = 7 tokens
Why is “want” only 1 token in English, but “voudrais” 4 tokens? Following the French example, would “wants” and “wanted” map to 1 or two tokens?
[+] HDMI_Cable|3 years ago|reply
I think it’s because the article itself is a bit wrong: ‘voudrais’ in French is more analogous to ‘I would like’ in English than ‘want’. Specifically, the ‘v-‘ indicates that this means ‘to want’, ‘-oud-‘ means that it is in the conditional or future, while ‘-ais’ would indicate its first person conditional. This being said, it makes sense ‘voudrais’ is more tokens than ‘want’, because it encodes more information.
[+] seba_dos1|3 years ago|reply
tl;dr - because it operates on tokens, not words, and the set of tokens it uses is optimized for representing English text.
[+] shagie|3 years ago|reply
Do other languages have as nice a mapping to tokens?

For example, if you were to go from French, you'd have 33 characters to work with rather than 26 (accents such). And you'd have chemisier and chemisière being two different genders of the same word that are used in different contexts.

English tends to not have this difference.

Likewise, French has more verb conjugation forms than English does.

If you were to go to Japanese, you'd have the hiragana, katakana and kanji.

While my Anglocentrism may be showing, I'm not sure there is another language that tokenizes as well when it comes to novel character combinations.

    Make up a new word.  Use it in a setence.  Give a definition for it.

    My new word is 'diflubble'. It is the feeling one gets when they are both excited and nervous in anticipation of an upcoming event. 

    For example, I felt diflubble on the morning of my graduation ceremony.
vs:

    Make up a new word in Japanese.  Use it in a setence and give a translation for it.  Give a definition for it.

    My new Japanese word is "keigarou", which means "being full of energy".

    例えば、私は今日、keigarouな気持ちでいます。
    Translation: For example, I am feeling keigarou today.
The thing there is that you can't just make up new kanji. And it wouldn't be hiragana either.
[+] hadlock|3 years ago|reply
I would imagine they have far more english optimized compute instances running
[+] 29athrowaway|3 years ago|reply
It is not that tokenization is optimized for English, but rather the other way around perhaps.

Take "lampara" or "pantalones" in Spanish for example. English speakers were clever enough to shorten those words to "lamp" and "pants" respectively. And they have done this with many words.

Translate text into Spanish and you will see text gets longer and there is more meaning encoded into words.

"La mesa" refers to a female table, although tables are not lifeforms and have no sex.

To me some languages impose a communication tax. It is taboo because people conflate language and culture and such.

[+] PufPufPuf|3 years ago|reply
It's funny that you're calling English "effective" because it has shorter words, even though word length has nothing to do with tokenization effectiveness -- if a long word is frequent enough, it becomes a single token. That's the point of doing tokenization instead of feeding raw bytes into the model.

BTW, English might have shorter words than many languages, but the sentences get wordier. For example, English "die" is shorter than Czech "umřít", but the sentence "We are going to die." is much longer than "Umřeme." in Czech.

[+] Y_Y|3 years ago|reply
"la mesa" isn't a female table, it's just a table. If you want to specify that the table is female (in reality) then you might say "mesa hembra". The fact that "mesa" is _grammatically_ feminine is a red herring. It's a rule of the language that occasionally corresponds to nature, but that's in a very limited minority of cases. You can think of grammatical gender like an optional redundant bit (against, say mishearing) when giving some information, but since there's no other way to talk about a table it doesn't give any more information than "the table" when written down.
[+] cool_dude85|3 years ago|reply
One wonders whether highly agglutinative languages, then, might have even better performance than English in the tokenizer since they can pack much more meaning into a single word.

The linked article shows one such language, Malayalam, costing 15.7 times more. Try again.

[+] kevingadd|3 years ago|reply
If you familiarize yourself with ideographic/ideographic-adjacent languages like Japanese or Chinese you will probably notice that they are way more efficient than English. Yet those languages pay a tokenization tax too (thanks in no small part to the decisions of the largely western Unicode committees to favor western character sets - the UTF8 encoding favors ASCII tremendously)
[+] aqme28|3 years ago|reply
Different languages have different levels of conciceness of course, but I highly doubt that Spanish is anywhere close to 15x less concise than english.
[+] explaininjs|3 years ago|reply
Eh… “la mesa” is “the table”, English wins. Even in context, spanish conjunction rules allow you to elide pronouns in many cases that would be confusing in english.

The reason spanish might encode longer is the tokenization scheme compacts tokens based on popularity in training data, and most training data was english. No more no less.

[+] k8si|3 years ago|reply
Communication rates are very similar across languages: https://www.science.org/doi/10.1126/sciadv.aaw2594

See also (great read): https://pubmed.ncbi.nlm.nih.gov/31006626/

wrt your Spanish example: grammatical gender adds information redundancy to make it easier to process spoken language (e.g. helps with reference resolution). This redundancy enables Spanish speakers to speak at a relatively fast rate without incurring perception errors. English has fewer words but a slower speech rate. It's an optimization problem.

The speech rate issue isn't as obvious if you're only looking at text, but I'd argue/speculate that lossless speech as a language evolutionary constraint has implications for learnability.

tl;dr there is no communication tax, languages are basically equivalent wrt to information rate, they just solved the optimization problem of compactness vs speech rate differently