(no title)
deoxykev | 1 year ago
This creates a gap between the mechanical measurement of certainty and true understanding, much like mistaking the map for the territory or confusing the finger pointing at the moon with the moon itself.
I've done some work before in this space, trying to come up with different useful measures from the logprobs, such as measuring shannon entropy over a sliding window, or even bzip compression ratio as a proxy for information density. But I didn't find anything semantically useful or reliable to exploit.
The best approach I found was just multiple choice questions. "Does X entail Y? Please output [A] True or [B] False. Then measure the linprobs of the next token, which should be `[A` (90%) or `[B` (10%). Then we might make a statement like: The LLM thinks there is a 90% probability that X entails Y.
activatedgeek|1 year ago
In our paper [1], we find that asking a follow up question like "Is the answer correct?" and taking the normalized probability of "Yes" or "No" token (or more generally any such token trained for) seems to be best bet so far to get well-calibrated probabilities out of the model.
In general, the log-probability of tokens is not a good indicator of anything other than satisfying the pre-training loss function of predicting the "next token." (it likely is very well-calibrated on that task though) Semantics of language are a much less tamable object, especially when we don't quite have a good way to estimate a normalizing constant because every answer can be paraphrased in many ways and still be correct. The volume of correct answers in the generation space of language model is just too small.
There is work that shows one way to approximate the normalizing constant via SMC [2], but I believe we are more likely to benefit from having a verifier at train-time than any other approach.
And there are stop-gap solutions to make log probabilities more reliable by only computing them on "relevant" tokens, e.g. only final numerical answer tokens for a math problem [3]. But this approach kind of side-steps the problem of actually trying to find relevant tokens. Perhaps something more in the spirit of System 2 attention which selects meaningful tokens for the generated output would be more promising [4].
[1]: https://arxiv.org/abs/2406.08391 [2]: https://arxiv.org/abs/2404.17546 [3]: https://arxiv.org/abs/2402.10200 [4]: https://arxiv.org/abs/2311.11829
Der_Einzige|1 year ago
Indeed, ultra high temperature sampling in its own right should be studied. I can do top_k = 2 and temperature = system.maxint and get decent results which are extraordinarily creative (with increasing probability of token related spelling issues as top_k goes up).
I'm convinced that the models logprobs hold so much bloody value and knowledge that I unironically do not care about how many "theoretical guarantees" it lacks or about it's non-correspondence to our usage of language.
[1]: Btw, this paper is now ICLR 2025 accepted and likely going to get an oral/honorable mention since we are ranked #18 out of all submissions by scores and have extremely favorable meta-review. Peer review seems to agree with our claims of extreme performance improvements.
mrciffa|1 year ago
canjobear|1 year ago