top | item 47134920 (no title) andyferris | 7 days ago Actually - do they do this in LLM benchmarks? As a measure of overconfidence/confabulation? Seems immediately applicable. discuss order hn newest impossiblefork|6 days ago I don't think it's a common thing in any public LLM benchmarks or in any standard QA datasets. Maybe in internal stuff at AI firms.
impossiblefork|6 days ago I don't think it's a common thing in any public LLM benchmarks or in any standard QA datasets. Maybe in internal stuff at AI firms.
impossiblefork|6 days ago