top | item 37111936

(no title)

Everything in archives pre-2021 is still untainted. All major social media, Q&A, code repos, and archive.org are timestamped. It taints future collection of training data, but not existing collection of training data.

discuss

jsheard|2 years ago

What's the plan then, to coast on pre-2021 data forever? How much utility would todays LLMs have if they were trained on fossilized archives of the internet from 10, 15, 20 years ago?