top | item 46926047

(no title)

data_maan | 22 days ago

As mathematically interesting the 10 questions are that the paper presents, the paper is --sorry for the harsh language-- garbage from the point of view of benchmarking and ML research: Just 10 question, few descriptive statistics, no interesting points other than "can LLMs solve these uncontaminated questions", no long bench of LLMs that were evaluated.

The field of AI4Math has so many benchmarks that are well executed -- based of the related work section it seems the authors are bit familiar with AI4Math at all.

My belief is that this paper is even being discussed solely because a Fields Medalist, Martin Hairer, is on it.

discuss

bawolff|22 days ago

Paper not about benchmarking or ML research is bad from the perspective of benchmarking. Not exactly a shocker.

The authors themselves literally state: "Unlike other proposed math research benchmarks (see Section 3), our question list should not be considered a benchmark in its current form"

data_maan|22 days ago

On the website https://1stproof.org/#about they claim: "This project represents our preliminary efforts to develop an objective and realistic methodology for assessing the capabilities of AI systems to autonomously solve research-level math questions."

Sounds to me to be a benchmark in all but a name. And they failed pretty terribly at achieving what they set out to do.