top | item 40334142

(no title)

Are there any specific reasons for using BPE, not Unigram, in LLMs? I've been trying to understand the impact of the tokenization algorithm, and Unigram was reported to be a better alternative (e.g., Byte Pair Encoding is Suboptimal for Language Model Pretraining: https://arxiv.org/abs/2004.03720). I understand that the unigram training process should eliminate under-trained tokens if trained on the same data as the LLM itself.

discuss

No comments yet.