You can probably use deep learning even if you don't have a lot of data

[+] rkaplan|8 years ago|reply

This post doesn't even mention the easiest way to use deep learning without a lot of data: download a pretrained model and fine-tune the last few layers on your small dataset. In many domains (like image classification, the task in this blog post) fine-tuning works extremely well, because the pretrained model has learned generic features in the early layers that are useful for many datasets, not just the one trained on.

Even the best skin cancer classifier [1] was pretrained on ImageNet.

[1]: http://www.nature.com/articles/nature21056

[+] a_bonobo|8 years ago|reply

This is how the great fast.ai course begins - download VGG16, finetune the top layer with a single dense layer, get amazing results. The second or third class shows how to make the top layers a bit more complex to get even better accuracy.

[+] alexcnwy|8 years ago|reply

Totally agree.

Similarly for word embeddings like word2vec, GLoVE, fasttext etc in the case of NLP.

I think this is fundamental - if you teach a human how to recognize street signs, you don't need to show them millions of examples - just one or a few of each is enough because we build on reference experiences of past objects seen through life experience to encode the new images as memories.

[+] ec109685|8 years ago|reply

It is mentioned at the end of the post:

> You don’t need Google-scale data to use deep learning. Using all of the above means that even your average person with only a 100-1000 samples can see some benefit from deep learning. With all of these techniques you can mitigate the variance issue, while still benefitting from the flexibility. You can even build on others work through things like transfer learning.

[+] bayonetz|8 years ago|reply

To be fair to the original article, his assertion was more along the lines of you can't train a deep net without lots of data. As the second article shows, that isn't true in the general case. However, it is certainly true for creating any of the interesting models you think of when you think of deep nets (i.e., Inception, word2vec, etc.). You just can't the richness of these without a lot of data to train them.

[+] autokad|8 years ago|reply

Deep learning can do pretty well when its not pre-trained as well.

I have this data set that is word counts for top 5k words, 5000 observations training, 5000 hold out. I consider this data pretty small.

SVM with rbf kernal can get around 87-88% accuracy, but a histogram kernal can get around 89.7% accuracy with a little feature engineering.

Tensorflow, after tuning some parameters, can also get around 89.7% accuracy as well.

[+] synaesthesisx|8 years ago|reply

Transfer learning is efficient (minimal training time) and useful for most classification tasks across various domains. Some of the first models I've used were built on Inception/ImageNet and I recall being thoroughly impressed by the performance.

[+] ganwar|8 years ago|reply

This only works though if the pretraining and your training, both are with the data in the same domain. Even in that you will have issues if the data was from same domain but different in representation, e.g. 2D and 3D image datasets.

[+] unknown|8 years ago|reply

[deleted]

[+] wakkaflokka|8 years ago|reply

What would you consider the best resource for learning how to do this in Python? I have a smaller set of image data where I'd like to identify the components of (i.e. 'house', 'car', etc.).

[+] option_greek|8 years ago|reply

Does transfer learning apply to seq2seq as well ?

[+] alexcnwy|8 years ago|reply

[deleted]

[+] ganwar|8 years ago|reply

[deleted]

[+] shadowmint|8 years ago|reply

You probably can... but is that really the issue?

I think the problem isnt that you cant solve problems with small amounts of data; its that you can't solve 'the problem' at a small scale and then just apply that solution at large scale... and that's not what people want or expect.

People expect that if you have an industrial welder than can assemble areoplanes (apparently), then you should easily be able to check it out by welding a few sheets of metal together, and if it welds well on a small scale, it should be representative of how well it welds entire vehicles.

...but thats not how DNN models work. Each solution is a specific selection of hyperparameters for the specific data and specific shape of that data. As we see here, specific even to the volume of data available.

It doesnt scale up and it doesn't scale down.

To solve a problem you just have to sort of.... just mess around with different solutions until you get a good one. ...and even then, you've got no really strong proof your solution is good; just that its better than the other solutions you've tried.

Thats the problem; its really hard to know when DNN are the wrong choice, vs. you're just 'doing it wrong'

[+] erickscott|8 years ago|reply

Andrew Beam's post offered very persuasive evidence that Jeff Leek's intuition (deep learning yields poor performance with small sample sizes) is incorrect. The error bars and the consistent trend of higher accuracy with a properly implemented deep learning model, particularly with smaller sample sizes, is devastating to Leek's original post.

I think this is a fantastic example of the speed and self-correcting nature of science in the internet-age.

As an aside, @simplystats blocked me on Twitter, which I assume is in response to this tweet: https://twitter.com/ErickRScott/status/871586233599893505 and it seems that I'm likely not the only one blocked: https://twitter.com/jtleek/status/871693250947624961

What's most concerning about @simplystats blocking activity is the chilling effect it has on discourse between differing perspectives. I've tried to come up with a rationale for why highlighting the most recent evidence in reply to someone who sympathized with Leek's original post (btw, @thomasp85 liked the tweet) is grounds for blocking , but I can't come up with a reasonable idea.

Further aside, is irq11 Rafael Irizarry?

Update: after emailing the members of @simplystats they have removed the block on my account and offered a reasonable explanation. SimplyStats is a force for good in the world (https://simplystatistics.org/courses/) and I look forward to their future contributions.

[+] ska|8 years ago|reply

It's an interesting conversation but really weakened by failing to take on the generalization problem head on. This is something I see in a lot of discussions about deep nets on smaller data sets, whether transfer or not. The answer "it's built in" is particularly unsatisfying.

The plots shown certainly should raise the spectre of overtraining - and rather than handwaving about techniques to avoid it, it would be great to see a detailed discussion of how you convince yourself (i.e. with additional data) that you are reasonably generalizable. Deep learning techniques are no panacea here.

[+] m3kw9|8 years ago|reply

Ppl keep saying a lot without even thinking is a relative term. For images a lot means enough to get to x percentage accuracy, for OCR for a single font, a lot means 26 letters + special chars and numbers. Stop saying a lot blindly like every one underatands.

[+] irq11|8 years ago|reply

...but why would you?

The fact that there are people "getting their jimmies up" on questions of training massively paramterized statistical models on tiny amounts of data should tell you exactly where we are on the deep-learning hype cycle. For a while there, SVMs were the thing, but now the True Faithful have moved on to neural networks.

The argument this writer is making is essentially: "yes, there are lots of free parameters to train, and that means that using it with small data is a bad idea in general, but neural networks have overfitting tools now and they're flexible so you should use them with small data anyway." This is literally the story told by the bulleted points.

Neural networks are a tool. Don't use the tool if it isn't appropriate to your work. Maybe you can find a way to hammer a nail with a blowtorch, but it's still a bad idea.

[+] tsiki|8 years ago|reply

I think you're missing the point. The jimmies are getting rustled up because someone provides false information about the performance to make his own argument seem better. This is something anyone should be against.

[+] RandomInteger4|8 years ago|reply

It might be that people want to learn to use deep learning, but they find the NIST AND other sets boring and want to learn on a problem they find interesting, but still produce something that seems to work.

Plus, learning to learn how to learn on less can only help the field of learning. That's the goal of one shot learning right?

[+] simcop2387|8 years ago|reply

Small question here, what are SVNs in this context?

[+] rahimnathwani|8 years ago|reply

s/SVN/SVM

[+] j7ake|8 years ago|reply

Of course you can use it but does it perform better than "shallow" methods such as Gaussian processes, SVMs, and multivariate linear regression ? either through theoretical or empirical evidence ?

[+] minimaxir|8 years ago|reply

The original post used a linear regression (and apparently misimplemented an intended logistic regression); this post sees better results at all sample sizes with a proper deep learning approach.

[+] zensavona|8 years ago|reply

Maybe what you should do is deep learn some data and then do some deep learning with your deep learned [deep] data.

Deep.

[+] known|8 years ago|reply

aka Wisdom of Crowds

[+] deepnotderp|8 years ago|reply

Also, not to mention transfer learning is a big one.

Case in point, the silicon valley "not hotdog" classifier which they stopped at hotdog or not due to lack of training data when in reality they could've just used a pre trained net on imagenet. Lol, I was literally cringing through that episode so hard xD

[+] minimaxir|8 years ago|reply

The developer behind the real-world app mentioned why they did not use pretraining, and why you can't always use pretraining: https://news.ycombinator.com/item?id=14347513

> We ended up with a custom architecture trained from scratch due to runtime constraints more so than accuracy reasons (the inference runs on phones, so we have to be efficient with CPU + memory), but that model also ended up being the most accurate model we could build in the time we had. (With more time/resources I have no doubt I could have achieved better accuracy with a heavier model!)

[+] aub3bhat|8 years ago|reply

This debate is meaningless for several reasons:

1. The original argument is a strawman. What do they mean by "data"? Is it survey results, microarrays, "Natural" images, "Natural" language text or readings from an audio sensor? No ML researcher would argue that applying complex models such as CNNs is useful for say survey data. But if the data is domain specific, such as Natural Language text, images taken in particular context, etc. using a model and parameters that exhibit good performance is a good starting point.

2. Unlike how statisticians view data (as say a matrix of measurements or "Data Frame"), machine learning researchers view data at a higher level of representation. E.g. An image is not merely a matrix but rather an object that can be augmented by horizontally flipping, changing contrast etc. In case of text you can render characters using different fonts, colors etc.

3. Finally the example used in the initial blog post, of predicting 1 vs 0 from images is itself incorrect. Sure a statistician would "train" a linear model to predict 1 vs 0, however I as an ML researcher would NOT train any model at all and would just use [1] which has state of the art performance in character recognition in widely varying conditions. When you have only 80 images, why risk assuming that they are sampled in an IID manner from population, instead why not simply use a model thats trained on far larger population.

Now the final argument might look suspicious but its crucial in understanding the difference between AI/ML/CV vs Statistics. In AI/ML/CV the assumption is that there are higher level problems (Character recognition, Object recognition, Scene understanding, Audio recognition) which when solved enable us to apply them in wide variety of situations where they appear. Thus when you encounter a problem like digit recognition the answer an ML researcher would give is to use a state of the art model.

[1] https://github.com/bgshih/crnn

[+] Sean1708|8 years ago|reply

Are you using some accepted definition of Statistics that I'm unaware of? Because I always thought Machine Learning was a branch of Statistics.

78 comments