top | item 44657754

(no title)

rtrgrd | 7 months ago

I thought human preferences was typically considered a noisy reward signal

discuss

If it was just "noisy", you could compensate with scale. It's worse than that.

"Human preference" is incredibly fucking entangled, and we have no way to disentangle it and get rid of all the unwanted confounders. A lot of the recent "extreme LLM sycophancy" cases is downstream from that.