I think the discussion around "exponentials" with top-end LLM (think 3.5 sonnet, gpt-4 not the smaller models) scaling is really pointless. The heuristic we have for what to expect from performance is just scaling, which has worked pretty well. These benchmarks are imperfect in lots of ways, aren't necessarily sensitive to showing exponential progress and it is difficult to predict step changes in capability in advance.If you zoom out on the first graphic from December 2023 back to 2020, the capabilities of models released at that time on these benchmarks would be much much lower. The best lens for future performance of large models is uncertainty.
emregucerr|1 year ago