top | item 38536534

(no title)

I tried adding special tokens for a reddit-style dataset once. The format was: `<|post_author|>username<|post_title|>title here...`

The resulting model was so much worse than just formatting everything plaintext. This was with MPT-30B, 15 special tokens, 300M training tokens, and a full finetune.

I may have made a mistake, but I haven't seen any open source finetunes successfully add a large number of tokens yet either.

discuss

Tostino|2 years ago

Try doing the same thing in your dataset, but don't actually add them as "special tokens", and just let them just be multiple tokens.

Adding new tokens needs a ton of data to train what the token means. Reusing existing tokens, will allow you to easily teach that a sequence of tokens now has a new meaning after fine tuning.

float-trip|2 years ago

That's what I ended up doing (`[Author] username [Title] post title...`)

> Adding new tokens needs a ton of data to train what the token means.

But how much? 300M tokens is fine for a simple version of ChatML with ~4 tokens. Not for 15, at least in my case. How's this relationship scale?

Just trying to offer one datapoint for what doesn't work, with the hedge that I might have just had a bug

tayo42|2 years ago

I don't mean add special tokens, but make the vocab only the set of possible cards. each card is a token.

a simple input might be <cards you hold> 1 14 56</end><cards to pick> 5 64 2</end> -> predicted token is the draft pick.

Then train a transformer based network from scratch.