After reading the Self-Rewarding Language Models paper by the team at Meta, it felt very approachable and reproducible, so we spent some time implementing it.
The scripts provided take any base model and put it in a loop of:
1) Supervised fine-tuning on an initial dataset
2) Generating new prompts using the SFT
3) Generating N responses per prompt
4) Scoring the generated responses 1-5
5) Running DPO on the rewards from the model itself.
We've run it through one loop starting with a Mistral-7b base model and the results are pretty encouraging so far.
Feel free to check it out or run it for yourself and let us know what you think:
gregschoeninger|1 year ago
After reading the Self-Rewarding Language Models paper by the team at Meta, it felt very approachable and reproducible, so we spent some time implementing it.
The scripts provided take any base model and put it in a loop of:
1) Supervised fine-tuning on an initial dataset
2) Generating new prompts using the SFT
3) Generating N responses per prompt
4) Scoring the generated responses 1-5
5) Running DPO on the rewards from the model itself.
We've run it through one loop starting with a Mistral-7b base model and the results are pretty encouraging so far.
Feel free to check it out or run it for yourself and let us know what you think:
https://github.com/Oxen-AI/Self-Rewarding-Language-Models
belter|1 year ago
mmusc|1 year ago
unknown|1 year ago
[deleted]