(no title)
reexpressionist | 2 years ago
And I agree that the leaderboards don't currently reflect the quantities of interest typically needed in practice.
reexpressionist | 2 years ago
And I agree that the leaderboards don't currently reflect the quantities of interest typically needed in practice.
minimaxir|2 years ago
That is very, very hard to do in an objective manner, as the current LLM benchmark gaming demonstrates.
Sure, you can deploy a smaller model to production to get real-world user data and feedback, but a) deploying a suboptimal model can give a bad first impression and b) the quality is still subjective and requires other metrics to be analyzed. Looking at prediction probabilities only really helps if you have a single correct output token, which isn't what LLM benchmarks test for.
reexpressionist|2 years ago
Hopefully in 2024 we can get at least one of the benchmarks to move to assessing non-parametric/distribution-free uncertainty for selective classification, reflecting more recent CS/Stats advances that should be used in practice. Working on it.