(no title)
RC_ITR | 18 days ago
This pattern of considering 90% accuracy (like the level we've seemingly we've stalled out on for the MMLU and AIME) to be 'solved' is really concerning for me.
AGI has to be 100% right 100% of the time to be AGI and we aren't being tough enough on these systems in our evaluations. We're moving on to new and impressive tasks toward some imagined AGI goal without even trying to find out if we can make true Artificial Niche Intelligence.
zarzavat|17 days ago
RC_ITR|17 days ago
Rudybega|18 days ago
As far as I can tell for AIME, pretty much every frontier model gets 100% https://llm-stats.com/benchmarks/aime-2025
RC_ITR|17 days ago
https://matharena.ai/?view=problem&comp=aime--aime_2026
As for MMLU, is your assertion that these AI labs are not correcting for errors in these exams and then self-reporting scores less than 100%?
As implied by the video, wouldn't it then take 1 intern a week max to fix those errors and allow any AI lab to become the first to consistently 100% the MMLU? I can guarantee Moonshot, DeepSeek, or Alibaba would be all over the opportunity to do just that if it were a real problem.
kingstnap|17 days ago
You don't need to take my word for it, try playing MMLU yourself.
https://d.erenrich.net/are-you-smarter-than-an-llm/index.htm...
Its not MMLU-Pro btw, which is considerably harder.
RC_ITR|17 days ago
simonw|18 days ago