top | item 43764170

(no title)

iceman_w | 10 months ago

RL constrains the space of possible output token sequences to what is likely to lead to the correct answer. So we are inherently making a trade-off to reduce variance. A non-RL model will have higher variance, so given enough attempts, it will come up with some correct answers that an RL model can't.

discuss

No comments yet.