(no title)
Rudybega | 18 days ago
As far as I can tell for AIME, pretty much every frontier model gets 100% https://llm-stats.com/benchmarks/aime-2025
Rudybega | 18 days ago
As far as I can tell for AIME, pretty much every frontier model gets 100% https://llm-stats.com/benchmarks/aime-2025
RC_ITR|17 days ago
https://matharena.ai/?view=problem&comp=aime--aime_2026
As for MMLU, is your assertion that these AI labs are not correcting for errors in these exams and then self-reporting scores less than 100%?
As implied by the video, wouldn't it then take 1 intern a week max to fix those errors and allow any AI lab to become the first to consistently 100% the MMLU? I can guarantee Moonshot, DeepSeek, or Alibaba would be all over the opportunity to do just that if it were a real problem.