top | item 36244558

(no title)

countmora | 2 years ago

> unfortunately the tokenizer was trained on this subreddit

Do you have a source for that or was it just an assumption?

discuss

It's pretty much guaranteed. Where else on the internet would this sequence of characters appear so frequently that it gets selected as one of the internet's top ~50,000 words?

Also, that Reddit is frequently used to train LLMs is widely known. It's an unusually clean source of conversational text because you can slice threads (i.e. pick a root comment, then pick a child, then a child of the child etc and then concatenate the results), and you'll get a coherent conversation. There are relatively few places on the internet where that is true. For example most phpBB forums conflate many different conversations into single threads, with ad-hoc quoting being used to disambiguate which post is replying to which. That makes it a lot harder to generate sample conversations from.

dontupvoteme|2 years ago

>There are relatively few places on the internet where that is true

Imageboards.

DailyMail.

Slashdot.

Even a somethingawful dump would have been superior.

gl-prod|2 years ago

There was a video[0] on Computerphile about this topic

[0] https://www.youtube.com/watch?v=WO2X3oZEJOA

klooney|2 years ago

See the old SolidGoldMagikarp drama- it's happened before.