top | item 42829146 (no title) piecerough | 1 year ago SFT forces the model to output _that_ reasoning trace you have in data. RL allows whatever reasoning trace and only penalizes it if it does not reach the same answer discuss order hn newest No comments yet.
No comments yet.