top | item 44656820

(no title)

Melonololoti | 7 months ago

Yepp it continues the gathering of more and better data.

Ai is not a hype. We have started to actually do something with all the data and this process will not stop soon.

Aline the RL what is now happening through human feedback alone (thumbs up/down) is massive.

discuss

order

KaiserPro|7 months ago

It was always the case. We only managed to make a decent model once we created a decent dataset.

This meant making a rich synthetic dataset first, to pre-train the model, before fine tuning on real, expensive data to get the best results.

but this was always the case.

noname120|7 months ago

RLHF wasn't needed for Deepseek, only gobbling up the whole internet — both good and bad stuff. See their paper

rtrgrd|7 months ago

I thought human preferences was typically considered a noisy reward signal

ACCount36|7 months ago

If it was just "noisy", you could compensate with scale. It's worse than that.

"Human preference" is incredibly fucking entangled, and we have no way to disentangle it and get rid of all the unwanted confounders. A lot of the recent "extreme LLM sycophancy" cases is downstream from that.