top | item 42790224

(no title)

I did this with Claude over the holidays. Putting Claude in the role as a guesser and comparing the guess to another experience human player. It turns out they both matched each other.

discuss

suveen_ellawela|1 year ago

That's a nice experiment! I think codenames could definietly be an evaluation method for LLMs.

pieix|1 year ago

Elo on different card games/board games would be a great eval metric now that the systems are general enough to play Codenames, chess, poker…

__MatrixMan__|1 year ago

It would be fun to build one, perhaps mediated by an app, where you have to guess whether your spymaster is a human or an AI based on the quality of their choices.