(no title)
valine | 9 months ago
It’s a misconception that transformers reason in token space. Tokens don’t attend to other tokens. High dimensional latents attend to other high dimensional latents. The final layer of a decoder only transformer has full access to entire latent space of all previous latents, the same latents you can project into a distribution of next tokens.
woadwarrior01|9 months ago
That's essentially the core idea in Coconut[1][2], to keep the reasoning traces in a continuous space.
[1]: https://arxiv.org/abs/2412.06769
[2]: https://github.com/facebookresearch/coconut
x_flynn|9 months ago
valine|9 months ago
refulgentis|9 months ago
There's a lot of stuff that makes this hard to discuss, ex. "projectable to semantic tokens" you mean "able to be written down"...right?
Something I do to make an idea really stretch its legs is reword it in Fat Tony, the Taleb character.
Setting that aside, why do we think this path finding can't be written down?
Is Claude/Gemini Plays Pokemon just an iterated A* search?
aiiizzz|9 months ago
valine|9 months ago
jacob019|9 months ago
valine|9 months ago
But yeah, the LLM doesn’t even know the sampler exists. I used the last layer as an example, but it’s likely that reasoning traces exist in the latent space of every layer not just the final one, with the most complex reasoning concentrated in the middle layers.