(no title)
mattalex | 3 years ago
This can be done on a comparatively small scale, since you don't need to train trillions of words, but only train on the smaller high quality data (even openai didn't have a lot of that).
In fact, if you look at the original paper https://arxiv.org/pdf/2203.02155.pdf Figure 1, you can see that even small models already significantly beat the current SOTA.
Open source projects often have trouble securing the HW ressources, but the "social" resources for producing a large dataset are much easier to manage in OSS projects. In fact, the data the OSS project collects might just be better since they don't have to rely on paying a handful minimum wage workers to produce thousands of examples.
In fact one of the main objectives is to reduce the bias generated by openai's screening and selection process, which is doable since much more people work on generating the data.
No comments yet.