top | item 33887028

(no title)

There's a big concern in the long run: if ChatGPT seriously reduces the number of people who visit StackExchange domains, there will be no dataset no for GPT 4 / ChatGPT 2. E.g. what if a brand new programming language gains popularity, or new libraries with very different patterns of use?

This "paradox of reuse" is a really big deal, IMO (blog post on the topic: https://nmvg.mataroa.blog/blog/the-paradox-of-reuse-language...)

It's actually a separate concern from the spam / content bloat described in the linked post, but they complement each other in creating amplified harm for StackExchange-like platforms: some fraction of users may stop visiting the site (because ChatGPT answered without links) and some other fraction of users will submit spam to try and earn points.

discuss

nickvincent|3 years ago

Also, based on all the public info about InstructGPT (the closest ChatGPT "family member"), all of StackExchange is definitely in the training via OpenAI's "filtered Common Crawl", if it isn't also included as a special over-weighted training set (English Wikipedia, for instance, was over-weighted in GPT 3 training).