top | item 46114331

(no title)

swordsmith | 3 months ago

Seems like he thinks RLVR == learning from binary reward for the whole chain, completely discounting techniques to provide denser rewards like process reward supervision?

discuss

order

No comments yet.