(no title)
piecerough | 1 year ago
Later on community did SFT on such chain of thoughts. Arguably, R1 shows that was a side distraction, and instead a clean RL reward would've been better suited.
piecerough | 1 year ago
Later on community did SFT on such chain of thoughts. Arguably, R1 shows that was a side distraction, and instead a clean RL reward would've been better suited.
singularity2001|1 year ago
kevinventullo|1 year ago
robrenaud|1 year ago
pama|1 year ago
piecerough|1 year ago