top | item 37361203

(no title)

dingocat | 2 years ago

I have multiple questions regarding the methods of this test.

The biggest one is that, well... The test doesn't aim to see what GPT-4 can do and how well it does it, only whether the participant can guess the (possibly cherry-picked) answer the author decided on. In short, we don't know if he sampled answers and decided on the most probable answer (akin to consensus voting/self-consistency[1]), or if he asked a question and chose the first one.

Maybe GPT-4 guesses the correct answer for a question 80% of the time, but he got unlucky? You don't know, the author doesn't tell you. The answers are generated ahead of time and are the same every time you go through the test.

[1] https://doi.org/10.48550/arXiv.2203.11171

discuss

order

PaulDavisThe1st|2 years ago

> the [ ... ] answer the author decided on

The questions mostly have correct or incorrect answers, and where there is some leeway, the author provides a fairly detailed explanation of what they would consider correct in each case. Do you have some specific criticism of an answer that you believe the author gets wrong?

thomasahle|2 years ago

> only whether the participant can guess the (possibly cherry-picked) answer the author decided on

My understanding is that the quiz samples a new GPT-4 answer every time you use it. That's why you put a confidence rather than a 0%/100% answer. There's always a chance it'll fail by freak accident.

Sophira|2 years ago

If you're basing this on the animation used when revealing the answer, that's a fake effect. The source code[0] reveals that there's a typewriter effect that plays out when you select to answer the question.

Also, the commentary on the answers refers to specific parts of the answers. For it to be as in-depth as it is, it would have to be either pre-written or the commentary also generated by GPT on the fly. (And of course it wouldn't make sense to do that given the nature of the quiz.)

[0] https://nicholas.carlini.com/writing/llm-forecast/static/que...