(no title)
kleyd | 6 months ago
What's missing is a part with more plasticity that can work in parallel and bi-directionally interact with the current static models in real-time.
This would mean individually trained models based on their experience so that knowledge is not translated to context, but to weight adjustments.
movpasd|6 months ago
Disclaimer: These are my not-terribly-informed layperson's thoughts :^)
The attention mechanism does seem to give us a certain adaptability (especially in the context of research showing chain-of-thought "hidden reasoning") but I'm not sure that it's enough.
Thing is, earlier language models used recurrent units that would be able to store intermediate data, which would give more of a foothold for these kind of on-the-fly adjustments. And here is where the theory hits the brick wall of engineering. Transformers are not just a pure machine learning innovation, the key is that they are massively scalable, and my understand is part of this comes from the _lack_ of recurrence.
I guess this is where the interest in foundation models comes from. If you could take a codebase as a whole and turn it into effective training data to adjust the weights of an existing, more broadly-trained model, But is this possible with a single codebase's worth of data?
Here again we see the power of human intelligence at work: the ability to quite consciously develop new mental models even given very little data. I imagine this is made possible by leaning on very general internal world-models that let us predict the outcomes of even quite complex unseen ("out-of-distribution") situations, and that gives us extra data. It's what we experience as the frustrations and difficulties of the learning process.