Kind of ironic "Poop" is the word that stands out the most. But having an AI judge it seems weird. To get a true benchmark, the judge must be a human who is susceptible to the 'irrational' cues (like 'Poop' or humor) that the original paper highlighted.
jacob_indie|7 days ago
What is interesting though is that there are different judges and how they compare to each other (first looks at the data shows they are different).
Also, it is interesting to see how well the AI opponents and judges are picking up personality and clues based on round history. Some LLMs pick it up very well and counter humans, some are quite "dumb" and just submit random words.
Same for AI judges
I do store the reasoning of opponents and judges in the background but am not displaying it for the moment; maybe something interesting to add for later, but it would distort the data ;)