top | item 40334142

(no title)

kacperlukawski | 1 year ago

Are there any specific reasons for using BPE, not Unigram, in LLMs? I've been trying to understand the impact of the tokenization algorithm, and Unigram was reported to be a better alternative (e.g., Byte Pair Encoding is Suboptimal for Language Model Pretraining: https://arxiv.org/abs/2004.03720). I understand that the unigram training process should eliminate under-trained tokens if trained on the same data as the LLM itself.

discuss

order

No comments yet.