top | item 41542207

(no title)

Due to the limitations of gradient descent and training data we are limited in the architectures that are viable. All the top LLM's are decoder-only for efficiency reasons and all models train on the production of text because we are not able to train on the thoughts behind the text.

Something that often gives me pause is the consideration that it is actually possible to come up with an architecture which has a good chance of being capable of being an AGI (RNNs, transformers etc as dynamical systems) but the model weights that would allow it to happen cannot be found because gradient descent will fail or not even be viable.

discuss

No comments yet.