top | item 46096605

(no title)

bogtog | 3 months ago

The premise of this post and the one cited near the start (https://www.tobyord.com/writing/inefficiency-of-reinforcemen...) is that RL involves just 1 bit of learning for a rollout, rewarding success/failure.

However, the way I'm seeing this is that a RL rollout may involve, say, 100 small decisions out of a pool of 1,000 possible decisions. Each training step, will slightly upregulate/downregulate a given training step in the step's condition. There will be uncertainty about which decision was helpful/harmful -- we only have 1 bit of information after all -- but this setup where many steps are slowly learned across many examples seems like it would lend itself well to generalization (e.g., instead of 1 bit in one context, you get a hundred 0.01 bit insights across 100 contexts). There may be some benefits not captured by comparing the number of bits relative to pretraining.

As the blog says, "Fewer bits, sure, but very valuable bits", this also seems like a different factor that would also be true. Learning these small decisions may be vastly more valuable for producing accurate outputs than learning through pretraining.

discuss

refulgentis|3 months ago

Dwarkesh's blogging confuses me, because I am not sure if the message is free-associating, or, relaying information gathered.

ex. how this reads if it is free-associating: "shower thought: RL on LLMs is kinda just 'did it work or not?' and the answer is just 'yes or no', yes or no is a boolean, a boolean is 1 bit, then bring in information theory interpretation of that, therefore RL doesn't give nearly as much info as, like, a bunch of words in pretraining"

ex. how this reads if it is relaying information gathered: "A common problem across people at companies who speak honestly with me about the engineering side off the air is figuring out how to get more out of RL. The biggest wall currently is the cross product of RL training being slowww and lack of GPUs. More than one of them has shared with me that if you can crack the part where the model gets very little info out of one run, then the GPU problem goes away. You can't GPU your way out of how little info they get"

I am continuing to assume it is much more A than B, given your thorough sounding explanation and my prior that he's not shooting the shit about specific technical problems off-air with multiple grunts.

robrenaud|3 months ago

He is essentially expanding upon an idea made by Andrej Karpathy on his podcast about a month prior.

Karpathy says that basically "RL sucks" and that it's like "sucking bits of supervision through a straw".

https://x.com/dwarkesh_sp/status/1979259041013731752/mediaVi...

bugglebeetle|3 months ago

Dwarkesh has a CS degree, but zero academic training or real world experience in deep learning, so all of his blogging is just secondhand bullshitting to further siphon off a veneer of expertise from his podcast guests.

ACCount37|3 months ago

RL is very important - because while it's inefficient, and sucks at creating entirely new behaviors or features in LLMs, it excels at bringing existing features together and tuning them to perform well.

It's a bit like LLM glue. The glue isn't the main material - but it's the one that holds it all together.

DevelopingElk|3 months ago

RL before LLMs can very much learn new behaviors. Take a look at AlphaGo for that. It can also learn to drive in simulated environments. RL in LLMs is not learning the same way, so it can't create it's own behaviors.

macleginn|3 months ago

It is the same type of learning, fundamentally: increasing/decreasing token probabilities based on the left context. RL simply provides more training data from online sampling.