top | item 44902103

(no title)

tmule | 6 months ago

Unfortunately, as things stand, it’s well-known that behaviors and optimizations in small scale models fail to replicate in larger models.

discuss

order

yorwba|6 months ago

Doing hyperparameter sweeps on lots of small models to find the optimal values for each size and fitting scaling laws to predict the hyperparameters to use for larger models seems to work reasonably well. I think https://arxiv.org/abs/2505.01618 is the latest advance in that vein.

victorbjorklund|6 months ago

Which in itself is very interesting and requires study.

anvuong|6 months ago

It mostly has to do with sparsity in high dimensional space. When you scale things to the extreme everything is very far away from each other, the space is sparse, and random vectors have very high chance to be orthogonal, etc. All of these makes optimization incredibly slow and difficult. Just another facet of the so called "curse of dimensionality".

jebarker|6 months ago

Well-known but not well-understood

jph00|6 months ago

That's not widely true. E.g the GPT 4 tech report pointed out nearly all their experiments were done on models 1000x smaller than the final model.

tmule|6 months ago

Fair point, though I’d argue that there’s inherent selection bias for improvements that could fit a scaling law curve in the small model regime here.

indoordin0saur|6 months ago

But why? If we don't know why then how do we figure it out?