top | item 47134920

(no title)

andyferris | 7 days ago

Actually - do they do this in LLM benchmarks? As a measure of overconfidence/confabulation? Seems immediately applicable.

discuss

order

impossiblefork|6 days ago

I don't think it's a common thing in any public LLM benchmarks or in any standard QA datasets. Maybe in internal stuff at AI firms.