top | item 41947873

(no title)

tylerneylon | 1 year ago

I couldn't figure out if this project is based on an academic paper or not — I mean some published technique to determine LLM uncertainty.

This recent work is highly relevant: https://learnandburn.ai/p/how-to-tell-if-an-llm-is-just-gues...

It uses an idea called semantic entropy which is more sophisticated than the standard entropy of the token logits, and is more appropriate as a statistical quantification of when an LLM is guessing or has high certainty. The original paper is in Nature, by authors from Oxford.

discuss

order

vark90|1 year ago

The idea behind semantic entropy (estimating entropy of distribution over semantic units, instead of individual sequences in the output space) is great, but it's somewhat naive in the sense that it considers these semantic units to be well-defined partitions of output space. There is further generalization of this approach [1] which performs soft clustering of sampled outputs based on a similar notion of semantic equivalence between them.

But even with this in mind, there are caveats. We have recently published [2] a comprehensive benchmark of SOTA approaches to estimating uncertainty of LLMs, and have reported that while in many cases these semantic-aware methods do perform very well, in other tasks simple baselines, like average entropy of token distributions, performs on par or better than complex techniques.

We have also developed an open-source python library [3] (which is still in early development) that offers implementations of all modern UE techniques applicable to LLMs, and allows easy benchmarking of uncertainty estimation methods as well as estimating output uncertainty for deployed models in production.

[1] https://arxiv.org/abs/2307.01379

[2] https://arxiv.org/abs/2406.15627

[3] https://github.com/IINemo/lm-polygraph

mikkom|1 year ago

This is based on work done by this anonymous twitter account:

https://x.com/_xjdr

I have been following this quite closely, it has been very interesting as it seems smaller models can be more efficient with this sampler. Worth going through the posts if someone is interested in this. I kind of have a feeling that this kind of sampling is a big deal.

weitendorf|1 year ago

I don't believe it is, because I'd hope that academicians would better understand the distinction between token-uncertainty and semantic-uncertainty/semantic-correctness (or at least endeavor to establish a data-backed correlation between the two before making claims about their relation). As I noted in my other comment, I believe that the author of this is making a fundamental misunderstanding, which per their note at the top, is probably why they haven't been able to actually yield practical results.

I don't say that to be a hater or discourage them because they may well be on to something, and it's good for unique approaches like this to be tried. But I'm also not surprised there aren't academic papers about this approach because if it had no positive effects for the reasons I mention, it probably wouldn't get published.

trq_|1 year ago

It's not an academic paper as far as I know, which is why I wanted to write this up. But the project certainly has a cult following (and cult opposition) on ML Twitter.

tylerneylon|1 year ago

PS My comment above is aimed at hn readers who are curious about LLM uncertainty. To the authors of the post / repo: looks cool! and I'd be interested to see some tests on how well it works in practice to identify uncertainty.