top | item 37851277

(no title)

Thank you! I excluded the coding tasks as most annotators don't possess this expertise. I trust them in comparing pairs of dissimilar model outputs that don't require any specific skill but commonsense reasoning.

The only manual analysis was when I checked the passed/failed prompts of the top-performing model.

discuss

No comments yet.