top | item 47134920

(no title)

Actually - do they do this in LLM benchmarks? As a measure of overconfidence/confabulation? Seems immediately applicable.

discuss

I don't think it's a common thing in any public LLM benchmarks or in any standard QA datasets. Maybe in internal stuff at AI firms.