top | item 45575425

(no title)

shawntan | 4 months ago

You can have benchmarks with specifically constructed train-test splits for task-specific models. Train only on the train, then your results on test should be what is reported.

You can still game those benchmarks (tune your hyperparameters after looking at test results), but that setting measures for generalisation on the test set _given_ the training set specified. Using any additional data should be going against the benchmark rules, and should not be compared on the same lines.

discuss

order

YeGoblynQueenne|4 months ago

What I'm pointing out above is that everyone games the benchmarks in the way that you say, by tuning their models until they do well on the test set. They train, they test, and they iterate until they get it right. At that point any results are meaningless for the purpose of estimating generalisation because models are effectively overfit to the test set, without ever having to train on the test set directly.

And this is standard practice, like everyone does it all the time and I believe a sizeable majority of researchers don't even understand that what they do is pointless because that's what they've been taught to do, by looking at each other's work and from what their supervisors tell them to do etc.

Btw, we don't really care about generalisation on the test set, per se. The point of testing on a held-out test set is that it's supposed to give you an estimate of a model's generalisation on truly unseen data, i.e. data that was not available to the researchers during training. That's the generalisation we're really interested in. And the reason we're interested in that is that if we deploy a model in a real-world situation (rare as that may be) it will have to deal with unseen data, not with the training data, nor with the test data.