As LLM benchmarks go, this is not a bad take at all.
One interesting point about this approach is that is self balancing, so when more powerful models come up, there is no need to change it.
Author here - yes, I'm regularly adding new models to this and other TrueSkill-based benchmarks and it works well. One thing to keep in mind is the need to run multiple passes of TrueSkill with randomly ordered games, because both TrueSkill and Elo are designed to be order-sensitive, as people's skills change over time.
zone411|10 months ago