The goal is to iteratively create training data and add it to its own training set. The LLM acts as its own judge and scores its own responses to decide if it should add the data. It’s expensive to have a human in the loop labeling preferences, so the folks at Meta showed you can have a clever prompt and fine tune the model to judge its own responses.
No comments yet.