top | item 44590684

(no title)

slybot | 7 months ago

> Fit an ELO-style rating system (Bradley-Terry) to turn pairwise comparisons into absolute per-document scores.

There are some conceptual gaps and this sentence is misleading in general. First, this sentence implies that Bradley-Terry is a some sort of an Elo variant, which is not true. Elo rating introduced nearly 10 years later in completely different domain.

They are two completely different ranking systems. Bradley-Terry use ratio-based, while Elo use logistic score function. Scales of the scores are completely different as well as the their sensitivity to the score differences.

Possibly, Bradley-Terry is preffered by the authors due to simpler likelihood evaluation and update doesn't depend on the order of pairwise evaluations.

There is also variants of Elo-rating that use MLE (optimized Elo) and even recently Bayesian Elo. For post-hoc time invariant scores, there is randomized Elo rating and so on.

People like Elo ratings because they are simple to understand. Most of the time, they forget why they developed specifically for chess tournaments. All variants above and dozens more try to improve (fix) one aspect of the Elo ratings, because their application has no 100% clear determination of winner, the update scale parameter is too small or large, matches are played simultaneously, different matches played and so on.

Also, let say one document is always preffered one all LLMs then it has only wins, then MLE will result in flat marginal likelihood for that where the update parameter (c) will inf.

discuss

No comments yet.