top | item 38478383

(no title)

dnnssl2 | 2 years ago

How does one select a good candidate for the draft model in speculative decoding? I imagine that there's some better intuition than just selecting the next parameter count down (i.e 70B -> 13B, 13B -> 7B).

Also how does that interact with MoE models? Do you have a mini version of the MoE, with smaller experts?

discuss

chillee|2 years ago

This is indeed a bit of a dark art. Essentially, you want a balance between "is significantly faster than base model" and "generates similar stuff to the base model".

Anecdotally, folks often seem to use say, 70B base + 7B as verifier. But I think there's a lot of room for experimentation and improvement here.

You could... say, take a 70B model and maybe just chop off the last 90% of layers and then fine-tune. Or perhaps you could use a model that's trained to generate 8 tokens at once. Or perhaps you could just use statistical "n-gram" predictor.