top | item 43732802

(no title)

czk | 10 months ago

will be interesting to see how they tighten the reward signal / ground outputs in some verifiable context. don't reward it for sounding right (rlhf), reward it for being right. but you'd probably need some sort of system to backprop a fact-checked score, and i imagine that would slow down training quite a bit. if the verifier finds a false claim it should reward the model for saying "i dont know"

discuss

No comments yet.