top | item 41156936

(no title)

Afaik, they aren't really trained independently -- for most models, e.g. DINO, etc., the smaller sizes are actually distilled from larger models. It's much easier to generate performant models at smaller size via distillation.

And I'd be curious of the utility of model that scales up and down at inference - if this was the case you'd still need to have storage that is the same as the maximum model size. This would essentially be useless for embedded applications, etc., unless you have heavy quantization - but quantization in a small parameter space would probably make the smaller modes useless. I could see the benefit here in terms of optimizing latency for different applications but maybe you have other ideas.

Given all that, I think training for smaller number of parameters, as noted in OP, would kind of beat out some model that scales at inference time - especially when most people know what kind of application they are aiming to build and the required level of performance.

discuss

No comments yet.