top | item 45417107

(no title)

la_fayette | 5 months ago

At this point it would be an interesting idea, to collect examples, in a form of a community database, were LLMs miserably fail. I have examples myself...

discuss

vunderba|5 months ago

Any such examples are often "closely guarded secrets" to prevent them from being benchmaxxed and gamed - which is absolutely what would happen if you consolidated them in a publicly available centralized repository.

la_fayette|5 months ago

Since such a database should evolve continuously, I wouldn't see that as a problem. The important thing is, that each example is somehow verifiable, in the form of a unmodifiable test setup. So the LLM provides a solution, which is executed against the test to verify. Something like ACID3 Tests... But sure it can be gamed somehow in probably all setups...

squigz|5 months ago

This seems like a non-issue, unless I'm misunderstanding. If failures can be used to help game benchmarks, companies are doing so. They don't need us to avoid compiling such information, which would be helpful to actual users.