top | item 43959398

(no title)

fabmilo | 9 months ago

The interesting delta here is that this proves that we can distribute the training and get a functioning model. The scaling factor is way bigger than datacenters

discuss

unknown|9 months ago

[deleted]

comex|9 months ago

But does that mean much when the training that produced the original model was not distributed?

refulgentis|9 months ago

The RL, not the training. No?

itchyjunk|9 months ago

RL is still training. Just like pretraining is still training. SFT is also training. This is how I look at it. Models weights are being updated in all cases.