top | item 24953887

(no title)

nil-sec | 5 years ago

1. Isn't an issue. They make inference on a sample by sample basis. The network has no memory so it won't expect a 50/50 distribution on the test set just because its trained like that. Having a balanced distribution is the exact right thing to do because you do not want the network to be biased to one or the other class for any given sample. If it were unbalanced the network could achieve almost 0 training error by just predicting negative all the time. This is not what you want.

discuss

djsbshek|5 years ago

My main concerns with the imbalance are undersampling of the negative class data distribution relative to the positive class, and overestimating performance on the test splits. I can buy that you may want to train on a balanced dataset, but the testing condition should reflect the true case distribution as closely as possible.

I agree that you would not want to use only the class priors for prediction. However, I do not think it is clear that you would want to throw that information out. Also not sure that I agree with the statement that neural network has “no memory” of the prior class distribution. That is a strong claim to make about something as opaque as a neural net model.

nil-sec|5 years ago

They could have used all negative samples for testing (and even training if they would have done it better), yes. But once your test set is large enough, whatever that means, its not that relevant anymore. They are anyway "under sampling" by not recording data from all humans that are negative right now.

And no, it's not a strong claim to make. Of course the network learns the distribution of your training set. That's why you want it balanced. But during successive applications of inference the weights do not change, it has no state. So it cannot, for example, store that it just predicted 90% negative and now it would be time again for some positive prediction.