Sorry if this is a dumb question, but how does that make sure that the training process is not going into the wrong direction because of error accumulation?
Maybe I didn't understand something fundamental here. (Not an LLM expert.)
I don't think it does. And there is a pretty big risk that you end up picking up on some quirk ("bias") of your reward model that doesn't reflect reality -- GPT4 preferring longer answers is one such commonly observed bias. AFAIK there is not a great theoretical basis for why we can avoid mode collapse, except empirically the models are good enough to survive some bootstrapping.
I would like to add, there's plenty of examples, some in math (e.g. geometry) playing out over >1000 years and dozens of generations, of the same happening in humans.
That said, for both humans and this kind of LLMs, it does appear to improve performance, certainly in the near term.
"Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613."
Cool and impressive. I'm curious if this training method will become more common.
"We would also like to acknowledge contemporary work published independently on arXiv on 2024-01-18 by Meta & NYU (Yuan, et al) in a paper called Self-Rewarding Language Models, which proposes a similar general approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model. While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models."
[+] [-] starbugs|2 years ago|reply
Maybe I didn't understand something fundamental here. (Not an LLM expert.)
[+] [-] huac|2 years ago|reply
[+] [-] candiodari|2 years ago|reply
I would like to add, there's plenty of examples, some in math (e.g. geometry) playing out over >1000 years and dozens of generations, of the same happening in humans.
That said, for both humans and this kind of LLMs, it does appear to improve performance, certainly in the near term.
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] potatoman22|2 years ago|reply
Cool and impressive. I'm curious if this training method will become more common.
[+] [-] lhl|2 years ago|reply
* Announcement: https://twitter.com/billyuchenlin/status/1749975138307825933
* Model Card: https://huggingface.co/snorkelai/Snorkel-Mistral-PairRM-DPO
* Response Re-Ranker: https://huggingface.co/llm-blender/PairRM
"We would also like to acknowledge contemporary work published independently on arXiv on 2024-01-18 by Meta & NYU (Yuan, et al) in a paper called Self-Rewarding Language Models, which proposes a similar general approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model. While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models."
[+] [-] greyface-|2 years ago|reply
[+] [-] dang|2 years ago|reply
Self-Rewarding Language Models - https://news.ycombinator.com/item?id=39051279 - Jan 2024 (58 comments)
[+] [-] frogamel|2 years ago|reply
1. Train model like normal
2. Evaluate model using self
3. Use eval results for DPO finetune
[+] [-] lucidrains|2 years ago|reply
The aim is really to give a good base for follow up research / modifications, which I think there will be many for this paper
[+] [-] lucidrains|2 years ago|reply
[+] [-] dannyw|2 years ago|reply
[+] [-] choppaface|2 years ago|reply
[+] [-] greatpostman|2 years ago|reply
[+] [-] code51|2 years ago|reply
What's the evidence here that this is not just a kind of leaderboard hacking for LLMs?
[+] [-] nmitchko|2 years ago|reply
Only question, why do you name variables with the λ symbol?
[+] [-] lucidrains|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]