top | item 36244558

(no title)

countmora | 2 years ago

> unfortunately the tokenizer was trained on this subreddit

Do you have a source for that or was it just an assumption?

discuss

order

mike_hearn|2 years ago

It's pretty much guaranteed. Where else on the internet would this sequence of characters appear so frequently that it gets selected as one of the internet's top ~50,000 words?

Also, that Reddit is frequently used to train LLMs is widely known. It's an unusually clean source of conversational text because you can slice threads (i.e. pick a root comment, then pick a child, then a child of the child etc and then concatenate the results), and you'll get a coherent conversation. There are relatively few places on the internet where that is true. For example most phpBB forums conflate many different conversations into single threads, with ad-hoc quoting being used to disambiguate which post is replying to which. That makes it a lot harder to generate sample conversations from.

dontupvoteme|2 years ago

>There are relatively few places on the internet where that is true

Imageboards.

DailyMail.

Slashdot.

Even a somethingawful dump would have been superior.

klooney|2 years ago

See the old SolidGoldMagikarp drama- it's happened before.