(no title)
jpdus | 2 years ago
"I benchmarked on SAT reading, which is a nice human reference for reasoning ability. Took 3 sections (67 questions) from an official 2008-2009 test (2400 scale) and got the following results, here a SAT-like test:
- GPT3.5 - 690 (10 wrong) - GPT4 - 770 (3 wrong) - GPT4-turbo (one section at time) - 740 (5 wrong) - GPT4-turbo (3 sections at once, 9K tokens) - 730 (6 wrong)"
Source: https://twitter.com/wangzjeff/status/1721934560919994823?t=P...
dazzaji|2 years ago
rafaelero|2 years ago
exo-pla-net|2 years ago
Terretta|2 years ago
You seem to be suggesting it got a bit worse, and the aider article seems to suggest gpt4 got a bit worse, although much faster at being a bit worse, while gpt3.5 got worse, then better, while faster.
reitzensteinm|2 years ago
However, in my opinion the first attempt score is more important, and Turbo does genuinely seem to lead there. There's still a possibility the updated training data has tainted the results.