(no title)
shawntan | 4 months ago
This would all go out the window if the model being evaluated can _see_ the type of distribution shift it would encounter during test time. And it's unclear whether the shift is the same in the hidden set.
There are questions about the evaluations that arise from the large model performance against the smaller models, especially given the ablation studies. Are the large models trained on the same data as these tiny models? Should they be? If they shouldn't, then why are we allowing these small models access to these in their training data?
No comments yet.