(no title)
tbalsam | 10 months ago
Also, for classification, MaxPooling is often far superior, you can learn an average smoothing filter in your convolutions beforehand in a data-dependent manner so that Nyquist sampling stuff is properly preserved.
Also, please do smoothed crossentropy for image class stuff (generally speaking, unless maybe data is hilariously large), MSE won't nearly cut it!
But that being said, adaptive stuff certainly is great when doing classification. Something to note is that batching does become an issue at a certain point -- as well as certain other fine-grained details if you're simply going to average it all down to one single vector (IIUC).
threeducks|10 months ago
Of course. The MSE here is not intended to be a training loss, but as a means to demonstrate that both approaches lead to almost the same result except for some rounding error. The MSE is somewhere in the order of 10^-9.
> Also, for classification, MaxPooling is often far superior, you can learn an average smoothing filter in your convolutions beforehand in a data-dependent manner so that Nyquist sampling stuff is properly preserved.
I don't think that max pooling the last feature maps would be a good idea here, because it would cut off about 98 % of the gradients and training would take much longer. (The shape of the input feature layer is (1, 768, 7, 7), pooled to (1, 768, 1, 1).)
> Something to note is that batching does become an issue at a certain point
Could you elaborate on that?
tbalsam|10 months ago
Ah, gotcha
> I don't think that max pooling the last feature maps would be a good idea here, because it would cut off about 98 % of the gradients and training would take much longer. (The shape of the input feature layer is (1, 768, 7, 7), pooled to (1, 768, 1, 1).)
MaxPooling is generally only useful if you're training your network for it, but in most cases it ends up performing better. That sparsity actually ends up being a good thing -- you generally need to suppress all of those unused activations! It ends up being quite a wide gap in practice (and, if you have convolutions beforehand -- using avgpooling2d is a bit of extra wasted extra computation blurring the input)
> Could you elaborate on that?
Variable-sized inputs don't batch easily as the input dims need to match, you can go down the padding route but that has its own particularly hellacious costs with it that end up taking away from compute that you could be using for other useful things.