(no title)
dnnssl2 | 2 years ago
Also how does that interact with MoE models? Do you have a mini version of the MoE, with smaller experts?
dnnssl2 | 2 years ago
Also how does that interact with MoE models? Do you have a mini version of the MoE, with smaller experts?
chillee|2 years ago
Anecdotally, folks often seem to use say, 70B base + 7B as verifier. But I think there's a lot of room for experimentation and improvement here.
You could... say, take a 70B model and maybe just chop off the last 90% of layers and then fine-tune. Or perhaps you could use a model that's trained to generate 8 tokens at once. Or perhaps you could just use statistical "n-gram" predictor.