(no title)
tsoj | 4 months ago
I think ARC-AGI was supposed to be a challenge for any model. The assumption being that you'd need the reasoning abilities of large language models to solve it. It turns out that this assumption is somewhat wrong. Do you mean that HRM and TRM are specifically trained on a small dataset of ARC-AGI samples, while LLMs are not? Or which difference exactly do hint at?
shawntan|4 months ago
Yes, precisely this. The question is really what is ARC-AGI evaluating for?
1. If the goal is to see if models can generalise to the ARC-AGI evals, then models being evaluated on it should not be trained on the tasks. Especially IF ARC-AGI evaluations are constructed to be OOD from the ARC-AGI training data. I don't know if they are. Further, there seems to be usage of the few-shot examples in the evals to construct more training data in the HRM case. TRM may do this via the training data via other means.
2. If the goal is that even _having seen_ the training examples, and creating more training examples (after having peeked at the test set), these evaluations should still be difficult, then the ablations show that you can get pretty far without universal/recurrent Transformers.
If 1, then I think the ARC-prize organisers should have better rules laid out for the challenge. From the blog post, I do wonder how far people will push the boundary (how much can I look at the test data to 'augment' my training data?) before the organisers say "This is explicitly not allowed for this challenge."
If 2, the organisers of the challenge should have evaluated how much of a challenge it would actually have been allowing extreme 'data augmentation', and maybe realised it wasn't that much of a challenge to begin with.
I tend to agree that, given the outcome of both the HRM and this paper, is that the ARC-AGI folks do seem to allow this setting, _and_ that the task isn't as "AGI complete" as it sets out to be.
shawntan|4 months ago
Just check out the original UT paper, or some of it's follow ups: Neural Data Router, https://arxiv.org/abs/2110.07732; Sparse Universal Transformers (SUT), https://arxiv.org/abs/2310.07096. There is even theoretical justification for why: https://arxiv.org/abs/2503.03961
The challenge is actually scaling them up to be useful as LLMs as well (I describe why it's a challenge in the SUT paper).
It's hard to say with the way ARC-AGI is allowed to be evaluated if this is actually what is at play. My gut tells me, given the type of data that's been allowed in the training set, that some leakage of the evaluation has happened in both HRM and TRM.
But because as a field we've given up on actually carefully ensuring training and test don't contaminate, we just decide it's fine and the effect is minimal. Especially considering LLMs, the test set example leaking into the dataset is merely a drop in the bucket (I don't believe we should be dismissing it this way, but that's a whole 'nother conversation).
With these models that are challenge-targeted, it becomes a much larger proportion of what influences the model behaviour, especially if the open evaluation sets are there for everyone to look at and simply generate more. Now we don't know if we're generalising or memorising.
ACCount37|4 months ago
They're adversarial benchmarks - they intentionally hit the weak point of existing LLMs. Not "AGI complete" by any means. But not useless either.