top | item 42332458

(no title)

px1999 | 1 year ago

Imo the con is picking the metric that makes others look artificially bad when it doesn't seem to be all that different (at least on the surface)

> we use a stricter evaluation setting: a model is only considered to solve a question if it gets the answer right in four out of four attempts ("4/4 reliability"), not just one

This surely makes the other models post smaller numbers. I'd be curious how it stacks up if doing eg 1/1 attempt or 1/4 attempts.

discuss

order

No comments yet.