top | item 43981352

(no title)

nielstron | 9 months ago

we were thinking about doing exactly this, the closest current work is probably the amazing "Learning Formal Mathematics from Intrinsic Motivation" by Poesia et al (they use constraints too increase the likelihood of generating correct theorems/proofs during RL)

https://arxiv.org/abs/2407.00695

discuss

informal007|9 months ago

Yes, RL works well in fields where answer can be verified in different degree. That's why AlphaGo success, it also should work in code generation and math.

imtringued|9 months ago

Your reward function can simply be the distance between the constrained output and the unconstrained output, that way you won't even need synthetic data, just a dataset of prompts to RL against.