(no title)
notavalleyman | 11 months ago
You'll see that AI companies, including openai, are generally not competing on accuracy benchmarks.
For example, here are the benchmarks on which open ai seem to be trying to compete.
MMLU: Measuring Massive Multitask Language Understanding,
MATH: Measuring Mathematical Problem Solving With the MATH Dataset,
GPQA: A Graduate-Level Google-Proof Q&A Benchmark,
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs,
MGSM: Multilingual Grade School Math Benchmark (MGSM), Language Models are Multilingual Chain-of-Thought Reasoners,
HumanEval: Evaluating Large Language Models Trained on Code,
ziddoap|11 months ago
First line of the abstract of MMLU: "We propose a new test to measure a text model's __multitask accuracy__."
Fourth line of the abstract of MATH: "To facilitate future research and __increase accuracy__ on MATH"
Second line of GPQA abstract: "We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach __65% accuracy__ [...] while highly skilled non-expert validators only reach __34% accuracy__"
Fifth line of the DROP abstract: "We apply state-of-the-art methods from both the reading comprehension and semantic parsing literature on this dataset and show that the best systems only achieve 32.7% F1 on __our generalized accuracy metric__"
From the MGSM paper: "MGSM __accuracy__ with different model scales."
Models are designed to output accurate information in a reasonable amount of time. That's literally the whole goal. The entire thing. A math-specific model wants to provide accurate math answers. A general model wants to provide accurate answers to general questions. That's the whole point.
notavalleyman|11 months ago