top | item 45220071

(no title)

puilp0502 | 5 months ago

What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?

discuss

jychang|5 months ago

Speculative decoding! It makes inference a LOT faster.

Instead of generating tokens one at a time, you generate the second one as well, and then use speculative decoding on that second token (instead of having it be produced by a draft model like Qwen 0.6b). If the token is checked and is correct, then the 2nd token gets generated MUCH faster.

If it's wrong, you have to generate it again the normal way (a lot slower than just checking it). Usually, it's correct, so inference is a lot faster.

stingraycharles|5 months ago

Because then the second token only needs to be checked, not generated, as it’s already generated? And it’s much faster to generate multiple tokens at the same time than one at a time? Is that the idea?

I’m not an expert on LLMs, just a user.

moffkalast|5 months ago

Hmm but isn't the checking only required because the draft model is not the same model and can only speculate what the main one is thinking, hence the name? If the main model generates two tokens itself, then how can it be wrong about its own predictions?

cubefox|5 months ago

> What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?

It is only useful for inference and doesn't help with pretraining. Which actually points to speculative decoding not being sufficiently general, as the same underlying property (some sequences of tokens are easy to predict) could be exploited for training as well. See here: https://goombalab.github.io/blog/2025/hnet-future/#d-footnot...

Zacharias030|5 months ago

There is no reason that it couldn’t be beneficial for training though.

rfoo|5 months ago

It could be a better draft model than separately trained EAGLE etc for speculative decoding.