top | item 39788213

(no title)

lappa | 1 year ago

Look at it from an algorithmic perspective. In computer science many algorithms take a non-constant number of steps to execute. However, in transformers models, there are a limited number of decoder blocks, and a limited number of FFN layers in each block. This presents a theoretical upper bound on the complexity of the algorithms a decoder network can solve in a single token generation pass.

This explains why GPT4 cannot accurately perform large number multiplication and decimal exponentiation. [0]

This example can extend to general natural language generation. While some answers can be immediately retrieved or generated by a "cache" / algorithm which exists in latent space, some tokens have better quality when their latent-space algorithm is executed in multiple steps.

[0] https://www.semanticscholar.org/reader/817e52b815560f95171d8...

discuss

visarga|1 year ago

> Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

This paper suggests that a large language model should "think ahead" by predicting not only the next token but also a "supporting thought." The approach involves generating all tokens simultaneously, allowing for a single forward pass that produces both the next token and a supporting thought, which might consist of, for example, 16 tokens.

This supporting thought influences the model's prediction. The process is then extended to multiple supporting thoughts by ingeniously masking cross-attention between thoughts to ensure their independence. So in essence we can fill all the remaining context with supporting thoughts and benefit from all of them in the same single forward pass.

The supporting thoughts themselves are trained with the objective to maximize the probability of a longer sequence ahead, using RL. So they are trained to optimize for longer-term, instead of the myopic next token prediction task.

https://arxiv.org/abs/2403.09629