(no title)
s5ma6n | 1 year ago
In my use case and tests, model itself is not capable of giving a reliable confidence value where logprobs almost always provide a better view on calibration.
s5ma6n | 1 year ago
In my use case and tests, model itself is not capable of giving a reliable confidence value where logprobs almost always provide a better view on calibration.
michaelt|1 year ago
But of course that's not the way LLMs are normally used. And it precludes any sort of chain-of-thought reasoning.
For some questions, like those involving calculations, letting the model talk to itself produces much better results. For example compare https://chatgpt.com/share/67238eda-6b08-8011-8d2d-a945f78e6f... to https://chatgpt.com/share/67235a98-d2c8-8011-b2bf-53c0efabea...
s5ma6n|1 year ago
Similar to exams where both the progress to the solution and the final outcome/value of the calculations are part of the grade.
To have the cake and eat it too for chain-of-thought reasoning, one way is to ask for a "final answer" so the final response token logprobs can be evaluated https://chatgpt.com/share/67239d92-b24c-800a-af8c-40da7be1f5...
Another trick is using JSON mode to keep intermediate results and final response separate, so each can be graded accordingly.
_jonas|1 year ago
https://cleanlab.ai/blog/4o-claude/
These approaches can detect errors better than random guessing, but there are other approaches that are significantly more effective in practice.
HappMacDonald|1 year ago