The interesting delta here is that this proves that we can distribute the training and get a functioning model. The scaling factor is way bigger than datacenters
RL is still training. Just like pretraining is still training. SFT is also training. This is how I look at it. Models weights are being updated in all cases.
unknown|9 months ago
[deleted]
comex|9 months ago
refulgentis|9 months ago
itchyjunk|9 months ago