top | item 36134911

(no title)

napier | 2 years ago

I’d like to see a model with the effluent of the internet intelligently filtered from the pretraining data by LLM and human curation, and much more effort to include digitised archival sources and the entirety of books and high quality media transcripts. I imagine it would yield far better baseline quality outputs with much less than current “requirements” for (over)correction with ultimately disastrous RLHF masking.

discuss

jiggawatts|2 years ago

I'd love to play with a version of GPT 4 fine-tuned with every science textbook written in the last few decades, every published science paper (not just preprints from ArXiV), and everything generated by every large research institute. Think NASA, CERN, etc...

Or one tuned with every fiction novel ever written, along with every screenplay.

benxh|2 years ago

So a model fine-tuned on libgen?

napier|2 years ago

I would gladly pay triple digits a month for exactly that.