Everything in archives pre-2021 is still untainted. All major social media, Q&A, code repos, and archive.org are timestamped. It taints future collection of training data, but not existing collection of training data.
What's the plan then, to coast on pre-2021 data forever? How much utility would todays LLMs have if they were trained on fossilized archives of the internet from 10, 15, 20 years ago?
jsheard|2 years ago