top | item 35460648

(no title)

dschnurr | 2 years ago

Hi folks – I work at OpenAI and helped build this page, awesome to see it on here! Heads up that it's a bit out of date as GPT4 has a different tokenizer than GPT3. I'd recommend checking out tiktoken (https://github.com/openai/tiktoken) or this other excellent app that a community member made (https://tiktokenizer.vercel.app)

discuss

order

lowefk|2 years ago

I wasn't aware that GPT-3 and GPT-4 use different tokenizers. I've read https://github.com/openai/openai-cookbook/blob/main/examples... and misinterpreted "ChatGPT models like gpt-3.5-turbo and gpt-4 use tokens in the same way as older completions models, ..." as GPT-3 and GPT-4 using the same tokenizer except for im_ tokens. Now I can see so many improvements, including the encoding of whitespaces and digits.

egorfine|2 years ago

Hey it seems that UTF-8 support is broken on the page.

Test phrase could be something like "Жизнь прекрасна и удивительна" ("Life is great" in russian).

I make an assumption that this is the implementation on the page that is broken, not the actual tokenizer. The reason: russian works perfectly in GPT-3 which I guess wouldn't be the case with a tokenization as presented on the page.

dqbd|2 years ago

Author here, you are correct! The issue here is due to the fact that a single user-perceived character might span into multiple tokens. This should be fixed now.

lemming|2 years ago

Are there plans to release tokenisers for other platforms? I'm accessing the OpenAI API from Clojure, and it would be really nice to have a JVM version so I can estimate token use before sending.

teruakohatu|2 years ago

That is very helpful, thank you. I had not realised the latest models were now tokenizing number as 3 digit groups. Can you give any insight into why 3 digits?

resters|2 years ago

Was the purpose of the page and post to generate comments that can be used as training data?