top | item 42829146

(no title)

piecerough | 1 year ago

SFT forces the model to output _that_ reasoning trace you have in data. RL allows whatever reasoning trace and only penalizes it if it does not reach the same answer

discuss

order

No comments yet.