top | item 45685211

(no title)

growdark | 4 months ago

I'd love to see a benchmark that tests different LLMs for slop, not necessarily limited to code. That might be even more interesting than ARC-AGI.

discuss

Bolwin|4 months ago

Der_Einzige|4 months ago

Note this is the same first author

jampa|4 months ago

Not a benchmark per se, but there is a "Not x, but y" Slop Leaderboard:

topaz0|4 months ago

100% of LLM output is slop. Done.

unknown|4 months ago

[deleted]