mnks | 3 years ago | on: Ask HN: Programs that saved you 100 hours? (2022 edition)
mnks's comments
mnks | 3 years ago | on: Tiktoken: OpenAI’s Tokenizer
mnks | 3 years ago | on: Shannon Entropy Imposes Fundamental Limits on Communication
[1]: https://blog.kardas.org/post/entropy/ (Average Code Length section)
mnks | 4 years ago | on: Papers with Code
mnks | 6 years ago | on: Draft of the Fast.ai Book
Your "powychodziłybyście" example could be translated as "you (feminine, plural) would have been going out". With the word tokenization, you get (ignoring comma and brackets) 8 tokens in English and one token in Polish. Now you can have three persons, two genders, two numbers, an imperfective or perfective verb, etc. resulting in combinatorial growth of word tokens in Polish. If you have all word forms for "go out" and you want to add "go in", in English you would add a single token "in", and in Polish you add all the tokens with "-wy-" replaced by "-w-". As a result in Polish you end up with much bigger vocabulary. Additionally you need bigger training corpus as you cannot learn the tokens independently. For example, if you know the meaning of "he ate" and "she wrote", you should be able to guess the meaning of "he wrote", as you've seen all of the tokens. In Polish it's "Zjadł", "Napisała" and "Napisał" - all of the word tokens are different.
Using the subword tokenization instead of word-level tokenization is kind of similar to using a normalized database instead of unnormalized one. It's not about one form being more complex than the other as they're equivalent. After all, will written English be much more complex if we remove all whitespaces? :)