(no title)
noch | 1 year ago
Karpathy wrote[^0]:
"
RL is powerful. RLHF is not.
[…]
And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch.
[…]
No production-grade actual RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale.
"
---
parodysbird|1 year ago