top | item 41810629

(no title)

gwillen | 1 year ago

> Aren’t coding copilots based on tokenizing programming language keywords and syntax?

No, they use the same tokenization as everyone else. There was one major change from early to modern LLM tokenization, made (as far as I can tell) for efficient tokenization of code: early tokenizers always made a space its own token (unless attached to an adjacent word.) Modern tokenizers can group many spaces together.

discuss

No comments yet.