top | item 33836835

(no title)

Also curious... I spent a while trying to get system to tell me directly, but no dice: https://twitter.com/nickmvincent/status/1598478685019189248?...

It gives a generic answer that it's some proprietary combinations of "books, articles and websites". I'd guess Wikipedia is in there for sure (English and maybe other editions as well), something like "BookCorpus" (https://huggingface.co/datasets/bookcorpus), probably a large scrape of news articles up to 2021. And definitely a full scrape of pretty much the entire academic/scientific literature (just based on poking around). Overall, probably very similar to GPT-3 (which is also a bit mysterious still!)

The official post (https://openai.com/blog/chatgpt/) also describes that some pretty rich human feedback data was collected as well, for the reinforcement learning component. I think this probably the real secret sauce for why this feels so qualitatively different than a lot of the LLMs that came before.

discuss

alchemist1e9|3 years ago

It’s odd how little discussion there is on inputs because the more reputable the inputs the more likely it can be trusted. I’d really like to know the body of knowledge it has been trained on.

My guess why this is obscured is legal, in that they have used a massive body of copyrighted data, and hope to avoid controversy over the inputs by trying not to talk about it.

I had seen once a huge collection of links to curated input data sets for language models but haven’t been able to find it yet in my notes/bookmarks unfortunately.

alchemist1e9|3 years ago

This was likely a significant percentage of the input data:

https://en.wikipedia.org/wiki/Common_Crawl

I also have a an odd hunch ChatGPT might have used a scihub mirror as inputs for example.