If your layer size is relatively small (not hundreds or thousands of nodes), dropout is usually detrimental and a more traditional regularization method such as weight-decay is superior.
For the size networks Hinton et al are playing with nowadays (with thousands of nodes in a layer), dropout is good, though.
I've found a combination of the 2 to be great. Most deep networks (even just the feed forward variety) tend to generalize better with mini batch samples of random drop out on multiple epochs. This is true of both images and word vector representations I've worked with.
Who is Arno Candel and why should we pay attention to his tips on training neural networks? Anyone who suggests grid search for metaparameter tuning is out of touch with the consensus among experts in deep learning. A lot of people are coming out of the woodwork and presenting themselves as experts in this exciting area because it has had so much success recently, but most of them seem to be beginners. Having lots of beginners learning is fine and healthy, but a lot of these people act as if they are experts.
his linkedin profile looks pretty legit to me.
http://www.linkedin.com/in/candel
I wouldn't want to get into a ML dick measuring contest with him anyway. H20 looks awesome too.
I think you are misinterpreting what he is saying about grid search. The grid search is just to narrow the field of parameters initially, he doesn't say how he would proceed after that point.
Just curious, what do you consider the state of the art? A Bayesian optimization? Wouldn't a grid search to start be like a uniform prior?
The rest of his suggestions looked on point to me, did you see anything else you would differ with? (i ask sincerely for my own education).
A question about the actual slides: why don't they use unsupervised pretraining (i.e. Sparse Autoencoder) for predicting MNIST? Is it just to show that they don't need pretraining to achieve good results or is there something deeper?
I've only been watching from the Deep Learning sidelines -- but I believe people have steered away from pretraining over the past year or two. I think on practical datasets it doesn't seem to help.
gamegoblin|11 years ago
If your layer size is relatively small (not hundreds or thousands of nodes), dropout is usually detrimental and a more traditional regularization method such as weight-decay is superior.
For the size networks Hinton et al are playing with nowadays (with thousands of nodes in a layer), dropout is good, though.
agibsonccc|11 years ago
vundervul|11 years ago
fredmonroe|11 years ago
I think you are misinterpreting what he is saying about grid search. The grid search is just to narrow the field of parameters initially, he doesn't say how he would proceed after that point.
Just curious, what do you consider the state of the art? A Bayesian optimization? Wouldn't a grid search to start be like a uniform prior?
The rest of his suggestions looked on point to me, did you see anything else you would differ with? (i ask sincerely for my own education).
agibsonccc|11 years ago
https://news.ycombinator.com/item?id=7803101
I will also add that looking in to hessian free for training over conjugate gradient/LBFGS/SGD for feed forward nets has proven to be amazing[1].
Recursive nets I'm still playing with yet, but based on the work by socher, they used LBFGS just fine.
[1]: http://www.cs.toronto.edu/~rkiros/papers/shf13.pdf
[2]: http://socher.org/
prajit|11 years ago
colincsl|11 years ago
TrainedMonkey|11 years ago
ivansavz|11 years ago
kdavis|11 years ago