(no title)
F6F6FA | 4 years ago
For a computer science analogy: It is a paper with the finding that most successful computer languages are created at prestigious institutes. An obvious -not a bad - finding. Not like you could give the motivation, skills, expertise, resources, and time to a small new institute, and expect these to come up with a new language which the community will adopt.
Yes, if you write and publish a good data set, and it gets adopted by the community, then you gain lots of citations. This reward is known, and therefore some researchers expend the effort of gathering and curating all this data.
It is not a "vehicle for inequality in science". Benchmarks in ML are a way to create an equal playing field for all, and allows one to compare results. Picking a non-standard new benchmark to evaluate your algorithm is bad practice. And benchmarks are the true meritocracy. Beat the benchmark, and you too can publish. No matter the PR or extra resources from big labs. It is test evaluation that counts, and this makes it fair. Other fields may have authorities writing papers without even an evaluation. That's not a good position for a field to be in.
> The prima facie scientific validity granted by SOTA benchmarking is generically confounded with the social credibility researchers obtain by showing they can compete on a widely recognized dataset
Here, authors pretend social credibility of researchers has any sway. There is no social credibility for a Master's student in Bangladesh, but when they show they can compete, then they can join and publish. Wonderful!
Where the authors use the long history of train-test splits, to pose the cons have outweighed the benefits, they should reason more and provide more data to actually show this and get the field to get along. Ironically, people take more note of this very paper, due to the institution affiliation of the authors. I do too. If they had a benchmark, I would have first looked at that.
> Given the observed high concentration of research on a small number of benchmark datasets, we believe diversifying forms of evaluation is especially important to avoid overfitting to existing datasets and misrepresenting progress in the field.
I believe these authors find diversity important. But for overfitting, these should look at actual (meta-) studies and data. This seems conflicting. For instance:
> A Meta-Analysis of Overfitting in Machine Learning (2019)
> We conduct the first large meta-analysis of overfitting due to test set reuse in the machine learning community. Our analysis is based on over one hundred machine learning competitions hosted on the Kaggle platform over the course of several years. In each competition, numerous practitioners repeatedly evaluated their progress against a holdout set that forms the basis of a public ranking available throughout the competition. Performance on a separate test set used only once determined the final ranking. By systematically comparing the public ranking with the final ranking, we assess how much participants adapted to the holdout set over the course of a competition. Our study shows, somewhat surprisingly, little evidence of substantial overfitting. These findings speak to the robustness of the holdout method across different data domains, loss functions, model classes, and human analysts.
No comments yet.