(no title)
andy12_ | 14 hours ago
Edit: There is also some other work that points out that chat models might not be calibrated at the token-level, but might be calibrated at the concept-level [2]. Which means that if you sample many answers, and group them by semantic similarity, that is also calibrated. The problem is that generating many answer and grouping them is more costly.
[1] https://arxiv.org/pdf/2303.08774 Figure 8
[2] https://arxiv.org/pdf/2511.04869 Figure 1.
geokon|14 hours ago
You could color code the output token so you can see some abrupt changes
It seems kind of obvious, so I'm guessing people have tried this
throwthrowuknow|10 hours ago