top | item 47163620

Bullshit benchmark for LLMs

1 points| gpvos | 3 days ago |twitter.com

1 comment

noemit|3 days ago

The underlying data looks scarce. If there's only a few questions per "category" of bullshit they can easily be gamed to favor one model over another.