top | item 44074572

(no title)

valine | 9 months ago

I think it’s helpful to remember that language models are not producing tokens, they are producing a distribution of possible next tokens. Just because your sampler picks a sequence of tokens that contain incorrect reasoning doesn't mean a useful reasoning trace isn’t also contained within the latent space.

It’s a misconception that transformers reason in token space. Tokens don’t attend to other tokens. High dimensional latents attend to other high dimensional latents. The final layer of a decoder only transformer has full access to entire latent space of all previous latents, the same latents you can project into a distribution of next tokens.

discuss

order

x_flynn|9 months ago

What the model is doing in latent space is auxilliary to anthropomorphic interpretations of the tokens, though. And if the latent reasoning matches a ground-truth procedure (A*), then we'd expect it to be projectable to semantic tokens, but it isn't. So it seems the model has learned an alternative method for solving these problems.

valine|9 months ago

You’re thinking about this like the final layer of the model is all that exists. It’s highly likely reasoning is happening at a lower layer, in a different latent space that can’t natively be projected into logits.

refulgentis|9 months ago

It is worth pointing out that "latent space" is meaningless.

There's a lot of stuff that makes this hard to discuss, ex. "projectable to semantic tokens" you mean "able to be written down"...right?

Something I do to make an idea really stretch its legs is reword it in Fat Tony, the Taleb character.

Setting that aside, why do we think this path finding can't be written down?

Is Claude/Gemini Plays Pokemon just an iterated A* search?

aiiizzz|9 months ago

Is that really true? E.g. anthropic said that the model can make decisions about all the tokens, before a single token is produced.

valine|9 months ago

That’s true yeah. The model can do that because calculating latents is independent of next token prediction. You do a forward pass for each token in your sequence without the final projection to logits.

jacob019|9 months ago

So you're saying that the reasoning trace represents sequential connections between the full distribution rather than the sampled tokens from that distribution?

valine|9 months ago

The lower dimensional logits are discarded, the original high dimensional latents are not.

But yeah, the LLM doesn’t even know the sampler exists. I used the last layer as an example, but it’s likely that reasoning traces exist in the latent space of every layer not just the final one, with the most complex reasoning concentrated in the middle layers.