(no title)
jo909 | 1 year ago
They could. They would easily be found out as they loose in real world usage or improved new unique benchmarks.
If you were in charge of a large and well funded model, would you rather pay people to find and "cheat" on LLM benchmarks by training on them, or would you pay people to identify benchmarks and make reasonably sure they specifically get excluded from training data?
I would exclude them as well as possible so I get feedback on how "real" any model improvement is. I need to develop real world improvements in the end, and any short term gain in usage by cheating in benchmarks seems very foolish.
gloosx|1 year ago
When you're in charge of a billion-dollar valuation company which is expected to remain unprofitable by 2029, it's hard to find a topic more crucial and intriguing than growth and making more money.
And yes, it is a recurring theme for vendors to tune their products specifically for industry-standard benchmarks. I can't find any specific reason for them not to pay people for training their model to score 90% on these 113 python tasks, as it directly drives profits up, whereas not doing it will bring absolute nothing to the table - surely they have their own internal benchmarks which they can exclude from training data.
youoy|1 year ago
You should already know by now that economic incentives are not always aligned with science/knowledge...
This is the true alignment problem, not the AI alignment one hahaha
concordDance|1 year ago
One is just a bit harder due to the less familiar mind "design".
carschno|1 year ago
Also, you are right that excluding test data from the training data improves your model. However, given the insane amounts of training data, this requires significant effort. If that additionally leads to your model performing worse in existing leaderboards, I doubt that (commercial) organizations would pay for such an effort.
And again, as long as there is no better evaluation method, you still won't know how much it really helps.
KeplerBoy|1 year ago
gershy|1 year ago