(no title)
thethirdone | 3 months ago
> Models look god-tier on paper:
> they pass exams
> solve benchmark coding tasks
> reach crazy scores on reasoning evals
Models don't look "god-tier" from benchmarks. Surely an 80% is not godlike. I would really like more human comparisons for these benchmarks to get a good idea of what an 80% means though.I would not say that any model shows a "crazy" score on ARC-AGI.
I broadly have seen incremental improvements in benchmarks since 2020, mostly at a level I would believe to be below average human reasoning, but above average human knowledge. No one would call GPT-3 godlike and it is quite similar to modern models in benchmarks; it is not a difference of like 1% vs 90%. I think most people would consider gpt-3 to be closer to opus 4.5 than opus 4.5 is to a human.
majormajor|3 months ago
Though I do not fully know where the boundary between "a model prompted to iterate and use tools" and "a model trained to be more iterative by design" is. How meaningful is that distinction?
But the people who don't get this are the less-technical/less-hands-on VPs, CEOs, etc, who are deciding on layoffs, upcoming headcount, "replace our customer service or engineering staffs with AI" things. A lot of those moves are going to look either really silly or really genius depending on exactly how "AGI-like" the plateau turns out to be. And that affects a LOT of people's jobs/livelihood, so it's good to see the hype machine start to slow down and get more realistic about the near-term future.
dwohnitmok|3 months ago
Tooling vs model is a false dichotomy in this case. The massive improvements in tooling are directly traceable back to massive improvements in the models.
If you took the same tooling and scaffolding and stuck GPT-3 or even GPT-4 in it, they would fail miserably and from the outside the tooling would look abysmal, because all of the affordances of current tooling come directly from model capability.
All of the tooling approaches of modern systems were proposed and prototypes were made back in 2020 and 2021 with GPT-3. They just sucked because the models sucked.
The massive leap in tooling quality directly reflects a concomitant leap in model quality.
azinman2|3 months ago
levocardia|3 months ago
thethirdone|3 months ago