Doing hyperparameter sweeps on lots of small models to find the optimal values for each size and fitting scaling laws to predict the hyperparameters to use for larger models seems to work reasonably well. I think https://arxiv.org/abs/2505.01618 is the latest advance in that vein.
It mostly has to do with sparsity in high dimensional space. When you scale things to the extreme everything is very far away from each other, the space is sparse, and random vectors have very high chance to be orthogonal, etc. All of these makes optimization incredibly slow and difficult. Just another facet of the so called "curse of dimensionality".
yorwba|6 months ago
swyx|6 months ago
victorbjorklund|6 months ago
anvuong|6 months ago
jebarker|6 months ago
jph00|6 months ago
tmule|6 months ago
indoordin0saur|6 months ago