Scaling Latent Reasoning via Looped Language Models

kelseyfrog|1 month ago

If you squint your eyes it's a fixed iteration ODE solver. I'd love to see a generalization on this and the Universal Transformer metioned re-envisioned as flow-matching/optimal transport models.

kevmo314|1 month ago

How would flow matching work? In language we have inputs and outputs but it's not clear what the intermediate points are since it's a discrete space.

cfcf14|1 month ago

This makes me think it would be nice to see some kinda child of modern transformer architecture and neural ODEs. There was such interesting work a few years ago on how neural ode/pdes could be seen as a sort of continuous limit of layer depth. Maybe models could learn cool stuff if the embeddings were somehow dynamical model solutions or something.

the8472|1 month ago

Does the training process ensure that all the intermediate steps remain interepretable, even on larger models? Not that we end up with some alien gibberish in all but the final step.

oofbey|1 month ago

Training doesn’t encourage the intermediate steps to be interpretable. But they are still in the same token vocabulary space, so you could decode them. But they’ll probably be wrong.

lukebechtel|1 month ago

so it's:

output = layers(layers(layers(layers(input))))

instead of the classical:

output = layer4(layer3(layer2(layer1(input))))

oofbey|1 month ago

Yeah if layers() is a shortcut for layer4(layer3(layer2(layer1(input)))). But sometimes it’s only

output = layers(input)

Or

output = layers(layers(input))

Depends on how difficult the token is.

15 comments