(no title)
vvrm | 3 years ago
> rough steps:
> 1. collect a very large dataset, see: https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla... . scrape, de-duplicate, clean, wrangle. this is a lot of work regardless of $.
Pile seemed quite clean and manageable to me (I was able to preprocess it ~8 hours for a simple task on consumer grade hardware). Is Pile clean and rich enough for LLM training too ?
> 2. get on a call with the sales teams of major cloud providers to procure a few thousands GPUs and enter into too long contracts.
It seems like the standard instructGPT model itself is based on a 1 billion param GPT model. Wouldn't that fit on a 24GB RTX 3090 ? Might take longer, maybe not enough opportunity for hyper-parameter search, but still possible right ? Or is hyper-parameter search on a thousand machines in parallel the real magic sauce here ?
> 3. "pretrain" a GPT. one common way to do this atm is to create your own exotic fork of MegatronLM+DeepSpeed. go through training hell, learn all about every possible NCCL error message, see the OPT logbook as good reference: https://github.com/facebookresearch/metaseq/blob/main/projec...
Sounds like a good opportunity to learn. No pain, no gain :-)
> 4. follow the 3-step recipe of https://openai.com/blog/chatgpt/ to finetune the model to be an actual assistant instead of just "document completor", which otherwise happily e.g. responds to questions with more questions. Also e.g. see OPT-IML https://arxiv.org/abs/2212.12017 , or BLOOMZ https://arxiv.org/abs/2211.01786 to get a sense of the work involved here.
Maybe somebody would open source the equivalent datasets for this soon ? Otherwise the data collection seems prohibitively expensive for somebody trying to do this for fun: contract expert annotators, train them, annotate/reannotate for months ?
No comments yet.