top | item 46114331 (no title) swordsmith | 3 months ago Seems like he thinks RLVR == learning from binary reward for the whole chain, completely discounting techniques to provide denser rewards like process reward supervision? discuss order hn newest No comments yet.
No comments yet.