top | item 38895631

(no title)

The alternative approach is to start with a small[er] model, but derive reliable uncertainty estimates, only moving to a larger model if necessary (i.e., if the probability of the predictions is lower than needed for the task).

And I agree that the leaderboards don't currently reflect the quantities of interest typically needed in practice.

discuss

minimaxir|2 years ago

> derive reliable uncertainty estimates

That is very, very hard to do in an objective manner, as the current LLM benchmark gaming demonstrates.

Sure, you can deploy a smaller model to production to get real-world user data and feedback, but a) deploying a suboptimal model can give a bad first impression and b) the quality is still subjective and requires other metrics to be analyzed. Looking at prediction probabilities only really helps if you have a single correct output token, which isn't what LLM benchmarks test for.

reexpressionist|2 years ago

I believe we have two rather different settings in mind. My statement assumes the enterprise use-case, where having a verifier is required. (In this context, I'm also assuming the approach of constraining against the observed data.) In such a selective classification setting, the end-user need not be exposed to lower quality outputs, but rather null predictions if the model cascade has been exhausted (i.e., progressively moving to larger models until the probability is acceptable).

Hopefully in 2024 we can get at least one of the benchmarks to move to assessing non-parametric/distribution-free uncertainty for selective classification, reflecting more recent CS/Stats advances that should be used in practice. Working on it.