top | item 45535220

(no title)

Defensive programming is considered "correct" by the people doing the reinforcing, and is a huge part of the corpus that LLM's are trained on. For example, most python code doesn't do manual index management, so when it sees manual index management it is much more likely to freak out and hallucinate a bug. It will randomly promote "silent failure" even when a "silent failure" results in things like infinite loops, because it was trained on a lot of tutorial python code and "industry standard" gets more reinforcement during training.

These aren't operating on reward functions because there's no internal model to reward. It's word prediction, there's no intelligence.

discuss

LeifCarrotson|4 months ago

LLMs do use simple "word prediction" in the pretraining step, just ingesting huge quantities of existing data. But that's not what LLM companies are shipping to end users.

Subsequently, ChatGPT/Claude/Gemini/etc will go through additional training with supervised fine-tuning, reinforcement learning with reward functions whether human-supervised feedback (RLHF) or reward functions (RLVR, 'verified rewards').

Whether that fine-tuning and reward function generation give them real "intelligence" is open to interpretation, but it's not 100% plagarism.

aoeusnth1|4 months ago

You used the word reinforcing, and then asserted there's no reward function. Can you explain how it's possible to perform RL without a reward function, and how the LLM training process maps to that?

MakeAJiraTicket|4 months ago

LLM actions are divorced from that reward function, it's not something they consult or consider. Reward function in that context doesn't make sense.

comex|4 months ago

Reinforcement learning by definition operates on reward functions.