top | item 40378363

Show HN: Safely batch byte-pair merges to speed up tokenization in Python 30X

1 points| alexandermorgan | 1 year ago |colab.research.google.com

4 comments

yorwba|1 year ago

For further optimization, custom index data structures à la https://en.wikipedia.org/wiki/Re-Pair would likely help.

Thanks for sharing, I wasn't aware of this. I'm having trouble seeing how it differs from the byte-pair encoding algorithm. More importantly though, how can it be linear time when it's recursive and you have to tally the counts of each pair again after each merge?