Thanks for sharing, I wasn't aware of this. I'm having trouble seeing how it differs from the byte-pair encoding algorithm. More importantly though, how can it be linear time when it's recursive and you have to tally the counts of each pair again after each merge?
yorwba|1 year ago
The algorithm is recursive in the sense that merged tokens can participate in further merges just as in byte-pair encoding, but it involves a whole lot of pointer-chasing, so the core is inherently quite iterative. Those pointers allow skipping over all tokens which are unaffected by a merge, so updating the counts is very cheap. You can imagine each initial token being equipped with a fixed compute budget, and merging two tokens uses up the compute budget of one token, but the compute budget for the other token remains for the merge result to use. So the overall time is bounded by (compute budget per token) x (number of tokens).
alexandermorgan|1 year ago