top | item 42005301

(no title)

s5ma6n | 1 year ago

I am puzzled why they have "asked the model" about the confidence and have not used the logprobs of the output tokens to estimate the confidence in responses.

In my use case and tests, model itself is not capable of giving a reliable confidence value where logprobs almost always provide a better view on calibration.

discuss

order

michaelt|1 year ago

To measure confidence based on the logprobs of a given token, you must first know which token you're measuring - that's why a lot of benchmarks love multiple choice questions where the LLM responds with a single token.

But of course that's not the way LLMs are normally used. And it precludes any sort of chain-of-thought reasoning.

For some questions, like those involving calculations, letting the model talk to itself produces much better results. For example compare https://chatgpt.com/share/67238eda-6b08-8011-8d2d-a945f78e6f... to https://chatgpt.com/share/67235a98-d2c8-8011-b2bf-53c0efabea...

s5ma6n|1 year ago

To me it boils down to what is to be measured here. With logprobs we can measure both correctness and not attempted i.e. if LLM is guessing the response.

Similar to exams where both the progress to the solution and the final outcome/value of the calculations are part of the grade.

To have the cake and eat it too for chain-of-thought reasoning, one way is to ask for a "final answer" so the final response token logprobs can be evaluated https://chatgpt.com/share/67239d92-b24c-800a-af8c-40da7be1f5...

Another trick is using JSON mode to keep intermediate results and final response separate, so each can be graded accordingly.

_jonas|1 year ago

Here are some benchmarks I ran that compare the precision/recall of various LLM error-detection methods, including logprobs and LLM self-evaluation / verbalized confidence:

https://cleanlab.ai/blog/4o-claude/

These approaches can detect errors better than random guessing, but there are other approaches that are significantly more effective in practice.

HappMacDonald|1 year ago

I wonder what would happen if token input included the logprob (or n/a for input from outside the LLM) of each token selected and the network were trained with that extra layer of information, especially during the human feedback training at the end.