At this point it would be an interesting idea, to collect examples, in a form of a community database, were LLMs miserably fail. I have examples myself...
Any such examples are often "closely guarded secrets" to prevent them from being benchmaxxed and gamed - which is absolutely what would happen if you consolidated them in a publicly available centralized repository.
Since such a database should evolve continuously, I wouldn't see that as a problem. The important thing is, that each example is somehow verifiable, in the form of a unmodifiable test setup. So the LLM provides a solution, which is executed against the test to verify. Something like ACID3 Tests... But sure it can be gamed somehow in probably all setups...
This seems like a non-issue, unless I'm misunderstanding. If failures can be used to help game benchmarks, companies are doing so. They don't need us to avoid compiling such information, which would be helpful to actual users.
vunderba|5 months ago
la_fayette|5 months ago
squigz|5 months ago