(no title)
czk
|
10 months ago
will be interesting to see how they tighten the reward signal / ground outputs in some verifiable context. don't reward it for sounding right (rlhf), reward it for being right. but you'd probably need some sort of system to backprop a fact-checked score, and i imagine that would slow down training quite a bit. if the verifier finds a false claim it should reward the model for saying "i dont know"
No comments yet.