top | item 45575496

(no title)

shawntan | 4 months ago

The question I keep coming back to is whether ARC-AGI is intended to evaluate generalisation to the task at hand. This would then mean that the test data has a meaningful distribution shift from the training data, and only a model that can perform said generalisation can do well.

This would all go out the window if the model being evaluated can _see_ the type of distribution shift it would encounter during test time. And it's unclear whether the shift is the same in the hidden set.

There are questions about the evaluations that arise from the large model performance against the smaller models, especially given the ablation studies. Are the large models trained on the same data as these tiny models? Should they be? If they shouldn't, then why are we allowing these small models access to these in their training data?

discuss

No comments yet.