top | item 42836175

(no title)

VierScar | 1 year ago

Wouldn't it be easier to cutoff pre-2020-ish, and ask it to create the transformer architecture of gpt? 1900 is so long ago I doubt most documents are good quality if they've been digitised at all. Most likely just low quality scanned images of inconsistent, half-illegible typewriter documents. Transcribed with OCR at best.

discuss

order

kccqzy|1 year ago

The problem I see with any date after the popularity of the internet is that you just can't be sure of the right date. A lot of traditional web forums now have backdated forum posts that are clearly made by LLM with an implausible date: https://hallofdreams.org/posts/physicsforums/

throwup238|1 year ago

You can use CommonCrawl - which has massive datasets going back to 2008 - and the Internet Archive.

cellis|1 year ago

Also so little training data from that era. Like, exponentially more data was created after, say, <year when most records become digitized = 1970>