If you squint your eyes it's a fixed iteration ODE solver. I'd love to see a generalization on this and the Universal Transformer metioned re-envisioned as flow-matching/optimal transport models.
This makes me think it would be nice to see some kinda child of modern transformer architecture and neural ODEs. There was such interesting work a few years ago on how neural ode/pdes could be seen as a sort of continuous limit of layer depth. Maybe models could learn cool stuff if the embeddings were somehow dynamical model solutions or something.
Does the training process ensure that all the intermediate steps remain interepretable, even on larger models? Not that we end up with some alien gibberish in all but the final step.
Training doesn’t encourage the intermediate steps to be interpretable. But they are still in the same token vocabulary space, so you could decode them. But they’ll probably be wrong.
kelseyfrog|1 month ago
kevmo314|1 month ago
cfcf14|1 month ago
the8472|1 month ago
oofbey|1 month ago
lukebechtel|1 month ago
output = layers(layers(layers(layers(input))))
instead of the classical:
output = layer4(layer3(layer2(layer1(input))))
oofbey|1 month ago
output = layers(input)
Or
output = layers(layers(input))
Depends on how difficult the token is.