(no title)
px1999 | 1 year ago
> we use a stricter evaluation setting: a model is only considered to solve a question if it gets the answer right in four out of four attempts ("4/4 reliability"), not just one
This surely makes the other models post smaller numbers. I'd be curious how it stacks up if doing eg 1/1 attempt or 1/4 attempts.
No comments yet.