mnks's comments | WingNews

mnks | 3 years ago | on: Ask HN: Programs that saved you 100 hours? (2022 edition)

Looking at your list you may like VisiData (https://www.visidata.org/). See the demo from 2018: https://www.youtube.com/watch?v=N1CBDTgGtOU

mnks | 3 years ago | on: Tiktoken: OpenAI’s Tokenizer

Have a look at https://beta.openai.com/tokenizer which uses javascript reimplementation of the GPT-2 / GPT-3 BPE tokenizer. In this case it's [31373, 995].

mnks | 3 years ago | on: Shannon Entropy Imposes Fundamental Limits on Communication

I wrote a blog post [1] with an interactive widget where you can provide an encoding for a random decimal digit and see how close you can get to the theoretical log₂(10) ≈ 3.32 bits.

[1]: https://blog.kardas.org/post/entropy/ (Average Code Length section)

mnks | 4 years ago | on: Papers with Code

Glad you like Papers with Code. Please check [1] for the list of scientific domains we currently support and [2] for CS in particular.

[1]: https://portal.paperswithcode.com/

[2]: https://cs.paperswithcode.com/

mnks | 6 years ago | on: Draft of the Fast.ai Book

It's not about the number of letters in the compounds, but about the number of morphemes.

Your "powychodziłybyście" example could be translated as "you (feminine, plural) would have been going out". With the word tokenization, you get (ignoring comma and brackets) 8 tokens in English and one token in Polish. Now you can have three persons, two genders, two numbers, an imperfective or perfective verb, etc. resulting in combinatorial growth of word tokens in Polish. If you have all word forms for "go out" and you want to add "go in", in English you would add a single token "in", and in Polish you add all the tokens with "-wy-" replaced by "-w-". As a result in Polish you end up with much bigger vocabulary. Additionally you need bigger training corpus as you cannot learn the tokens independently. For example, if you know the meaning of "he ate" and "she wrote", you should be able to guess the meaning of "he wrote", as you've seen all of the tokens. In Polish it's "Zjadł", "Napisała" and "Napisał" - all of the word tokens are different.

Using the subword tokenization instead of word-level tokenization is kind of similar to using a normalized database instead of unnormalized one. It's not about one form being more complex than the other as they're equivalent. After all, will written English be much more complex if we remove all whitespaces? :)