(no title)
charlescurt123 | 1 year ago
I'm honestly a bit envious of future engineers who will be tackling these kinds of problems with a 100-line Jupyter notebook on a laptop years from now. If we discovered the right method or algorithm for these long-horizon problems, a 2B-parameter model might even outperform current models on everything except short, extreme reasoning problems.
The only solution I've ever considered for this is expanding a model's dimensionality over time, rather than focusing on perfect weights. The higher dimensionality you can provide to a model, the greater its theoretical storage capacity. This could resemble a two-layer model—one layer acting as a superposition of multiple ideal points, and the other layer knowing how to use them.
When you think about the loss landscape, imagine it with many minima for a given task. If we could create a method that navigates these minima by reconfiguring the model when needed, we could theoretically develop a single model with near-infinite local minima—and therefore, higher-dimensional memory. This may sound wild, but consider the fact that the human brain potentially creates and disconnects thousands of new connections in a single day. Could it be that these connections steer our internal loss landscape between different minima we need throughout the day?
aDyslecticCrow|1 year ago
Models that change size as needed have been experimented with, but they are either too inefficient or difficult to optimize at a limited power budget. However, I agree that they are likely what is needed if we want to continue to scale upward in size.
I suspect the real bottleneck is a breakthrough in training itself. Backpropagation loss is too simplistic to optimize our current models perfectly, let alone future larger ones. But there is no guarantee a better alternative exists which may create a fixed limit to current ML approaches.
charlescurt123|1 year ago
What I’m advocating is a substantial increase in this aspect—keeping model size the same while expanding dimensionality. The "curse of dimensionality" illustrates how a modest increase in dimensions leads to a significantly larger volume.
While I agree that backpropagation isn’t a complete solution, it’s ultimately just a stochastic search method. The key point here is that expanding the dimensionality of a model’s space is likely the only viable long-term direction. To achieve this, backpropagation needs to work within an increasingly multidimensional space.
A useful analogy is training a small model on random versus structured data. With structured data, we can learn an extensive amount, but with random data, we hit a hard limit imposed by the network. Why is that?