top | item 45507970

(no title)

shawntan | 4 months ago

I think everyone should read the post from ARC-AGI organisers about HRM carefully: https://arcprize.org/blog/hrm-analysis

With the same data augmentation / 'test time training' setting, the vanilla Transformers do pretty well, close to the "breakthrough" HRM reported. From a brief skim, this paper is using similar settings to compare itself on ARC-AGI.

I too, want to believe in smaller models with excellent reasoning performance. But first understand what ARC-AGI tests for, what the general setting is -- the one that commercial LLMs use to compare against each other -- and what the specialised setting HRM and this paper uses as evaluation.

The naming of that benchmark lends itself to hype, as we've seen in both HRM and this paper.

discuss

tsoj|4 months ago

The TRM paper addresses this blog post. I don't think you need to read the HRM analysis very carefully, the TRM has the advantage of being disentangled compared to the HRM, making ablations easier. I think the real value of the arcprize HRM blog post is to highlight the importance of ablation testing.

I think ARC-AGI was supposed to be a challenge for any model. The assumption being that you'd need the reasoning abilities of large language models to solve it. It turns out that this assumption is somewhat wrong. Do you mean that HRM and TRM are specifically trained on a small dataset of ARC-AGI samples, while LLMs are not? Or which difference exactly do hint at?

shawntan|4 months ago

> Do you mean that HRM and TRM are specifically trained on a small dataset of ARC-AGI samples, while LLMs are not? Or which difference exactly do hint at?

Yes, precisely this. The question is really what is ARC-AGI evaluating for?

1. If the goal is to see if models can generalise to the ARC-AGI evals, then models being evaluated on it should not be trained on the tasks. Especially IF ARC-AGI evaluations are constructed to be OOD from the ARC-AGI training data. I don't know if they are. Further, there seems to be usage of the few-shot examples in the evals to construct more training data in the HRM case. TRM may do this via the training data via other means.

2. If the goal is that even _having seen_ the training examples, and creating more training examples (after having peeked at the test set), these evaluations should still be difficult, then the ablations show that you can get pretty far without universal/recurrent Transformers.

If 1, then I think the ARC-prize organisers should have better rules laid out for the challenge. From the blog post, I do wonder how far people will push the boundary (how much can I look at the test data to 'augment' my training data?) before the organisers say "This is explicitly not allowed for this challenge."

If 2, the organisers of the challenge should have evaluated how much of a challenge it would actually have been allowing extreme 'data augmentation', and maybe realised it wasn't that much of a challenge to begin with.

I tend to agree that, given the outcome of both the HRM and this paper, is that the ARC-AGI folks do seem to allow this setting, _and_ that the task isn't as "AGI complete" as it sets out to be.

ACCount37|4 months ago

Not exactly "vanilla Transformer", but rather "a Transformer-like architecture with recurrence".

Which is still a fun idea to play around with - this approach clearly has its strengths. But it doesn't appear to be an actual "better Transformer". I don't think it deserves nearly as much hype as it gets.

shawntan|4 months ago

Right. There should really be a vanilla Transformer baseline.

With recurrence: The idea has been around: https://arxiv.org/abs/1807.03819

There are reasons why it hasn't really been picked up at scale, and the method tends to do well on synthetic tasks.

unknown|4 months ago

[deleted]