(no title)
lappa | 1 year ago
This explains why GPT4 cannot accurately perform large number multiplication and decimal exponentiation. [0]
This example can extend to general natural language generation. While some answers can be immediately retrieved or generated by a "cache" / algorithm which exists in latent space, some tokens have better quality when their latent-space algorithm is executed in multiple steps.
[0] https://www.semanticscholar.org/reader/817e52b815560f95171d8...
visarga|1 year ago
This paper suggests that a large language model should "think ahead" by predicting not only the next token but also a "supporting thought." The approach involves generating all tokens simultaneously, allowing for a single forward pass that produces both the next token and a supporting thought, which might consist of, for example, 16 tokens.
This supporting thought influences the model's prediction. The process is then extended to multiple supporting thoughts by ingeniously masking cross-attention between thoughts to ensure their independence. So in essence we can fill all the remaining context with supporting thoughts and benefit from all of them in the same single forward pass.
The supporting thoughts themselves are trained with the objective to maximize the probability of a longer sequence ahead, using RL. So they are trained to optimize for longer-term, instead of the myopic next token prediction task.
https://arxiv.org/abs/2403.09629