How is this benchmark not inherently biased towards GPT?
If I did the same sort of thing but used Claude to grade the tests, would I get similar results? Or would that be inherently biased towards Claude scoring high?
Should be evaluating each prompt multiple times to see how much variance in the scores there are. Even gpt-4 grading gpt-4 should probably be done several times
bradknowles|2 years ago
If I did the same sort of thing but used Claude to grade the tests, would I get similar results? Or would that be inherently biased towards Claude scoring high?
crashocaster|2 years ago
habitue|2 years ago
natsucks|2 years ago
aiunboxed|2 years ago
jasonjmcghee|2 years ago
londons_explore|2 years ago
Is this our Concorde moment?
ionwake|2 years ago