top | item 38185364

(no title)

jpdus | 2 years ago

For other (non-code) benchmarks, people are having the opposite experience:

"I benchmarked on SAT reading, which is a nice human reference for reasoning ability. Took 3 sections (67 questions) from an official 2008-2009 test (2400 scale) and got the following results, here a SAT-like test:

- GPT3.5 - 690 (10 wrong) - GPT4 - 770 (3 wrong) - GPT4-turbo (one section at time) - 740 (5 wrong) - GPT4-turbo (3 sections at once, 9K tokens) - 730 (6 wrong)"

Source: https://twitter.com/wangzjeff/status/1721934560919994823?t=P...

discuss

order

dazzaji|2 years ago

Does anybody know if 2008-2009 SAT is in the training set for these models? Assuming so, I’d be especially interested in head-to-head evals on this type of non-code benchmark for problem sets not already in the training data, to see how it performs on fresh situations.

rafaelero|2 years ago

Probably not a statistically significant difference there.

Terretta|2 years ago

What did you mean by "opposite"?

You seem to be suggesting it got a bit worse, and the aider article seems to suggest gpt4 got a bit worse, although much faster at being a bit worse, while gpt3.5 got worse, then better, while faster.

reitzensteinm|2 years ago

The Aider article has been updated with the complete results. Previously Turbo was leading slightly. So far any difference is in the noise.

However, in my opinion the first attempt score is more important, and Turbo does genuinely seem to lead there. There's still a possibility the updated training data has tainted the results.