top | item 47163620 Bullshit benchmark for LLMs 1 points| gpvos | 3 days ago |twitter.com 1 comment order hn newest noemit|3 days ago The underlying data looks scarce. If there's only a few questions per "category" of bullshit they can easily be gamed to favor one model over another.
noemit|3 days ago The underlying data looks scarce. If there's only a few questions per "category" of bullshit they can easily be gamed to favor one model over another.
noemit|3 days ago