top | item 34425874

(no title)

iflp | 3 years ago

If you only care about identically distributed test data, test set overfitting doesn't happen that fast: if you evaluate M models on N test samples, the overfitting error is on the order of sqrt(log M / N). And even as this error becomes more noticeable, the relative ranks among the models are even more stable, as you can apply the small variance bounds. This is actually verified on models proposed for CIFAR-10.

discuss

rsfern|3 years ago

That’s cool, is that from https://proceedings.mlr.press/v97/feldman19a.html ?

I hadn’t seen that result before, definitely interested in related work

iflp|3 years ago

No. I was referring to the "standard concentration bound" in that paper, which applies when you have separate validation and test sets. I think the argument can usually be improved by applying small-variance inequalities such as Bernstein's, to excess risk-like quantities such as l(f_hat(x), y) - l(f_ref(x), y), to show that accuracy difference / relative rank enjoys better guarantees. For ImageNet we can use the 01 loss and set f_{ref} to a SoTA classifier which, while having its loss bounded away from 0, is "mostly similar" to most f_hat's, and thus leads to a small excess risk.

The CIFAR experiments I mentioned were https://arxiv.org/pdf/1806.00451.pdf. It doesn't contain this argument (unfortunate wording) but appears to support it well.