(no title)
nickvincent | 3 years ago
It gives a generic answer that it's some proprietary combinations of "books, articles and websites". I'd guess Wikipedia is in there for sure (English and maybe other editions as well), something like "BookCorpus" (https://huggingface.co/datasets/bookcorpus), probably a large scrape of news articles up to 2021. And definitely a full scrape of pretty much the entire academic/scientific literature (just based on poking around). Overall, probably very similar to GPT-3 (which is also a bit mysterious still!)
The official post (https://openai.com/blog/chatgpt/) also describes that some pretty rich human feedback data was collected as well, for the reinforcement learning component. I think this probably the real secret sauce for why this feels so qualitatively different than a lot of the LLMs that came before.
alchemist1e9|3 years ago
My guess why this is obscured is legal, in that they have used a massive body of copyrighted data, and hope to avoid controversy over the inputs by trying not to talk about it.
I had seen once a huge collection of links to curated input data sets for language models but haven’t been able to find it yet in my notes/bookmarks unfortunately.
alchemist1e9|3 years ago
https://en.wikipedia.org/wiki/Common_Crawl
I also have a an odd hunch ChatGPT might have used a scihub mirror as inputs for example.