Machine Learning 101: What Is Regularization?

[+] Muted|9 years ago|reply

If you are interested in learning the basics of Machine Learning I really recommend Andrew Ng's course on Coursera[0]. It starts off very basic and requires almost no prior knowledge other than some linear algebra. It evolves to more advance topics like Neural Networks, Support Vector Machines, Recommender Systems, etc. It is amazing how much you can do after just a few weeks of lessons. It's a very practical course with real world examples and exercises each week. The focus is mostly on using and implementing machine learning algorithms and less so on how and why the math behind them works. (To me this is a negative, but I'm sure some people will appreciate that.)

[0] https://www.coursera.org/learn/machine-learning

[+] kmiroslav|9 years ago|reply

Not a very well structured article. The animated graphic (which is cool) comes before the lambda parameter is even introduced. That animation should have been put at the end.

It's also not doing a good job at explaining when regularization should be used. The dot cloud in the article is self serving because these points are roughly aligned but they also have a vague cubic form, which explains why the cubic model works better.

The hard part is knowing when you should be using regularization, and that graph doesn't help because it doesn't show further points, therefore not explaining why a certain regression is better than another.

[+] unknown|9 years ago|reply

[deleted]

[+] achompas|9 years ago|reply

I actually wrote a blog post on regularization a few months ago. I think it covers some of the issues, like overfitting, glossed over in this post:

http://acompa.net/on-regularization.html

[+] yelnatz|9 years ago|reply

Do you have any sources that expands on how that LASSO optimization problem was graphed?

My thoughts when I looked at it: Where did beta come from? Whats the beta hat supposed to represent? Where'd the contours come from? How come they stopped there? Oh they're supposed to represent the weights? How do they represent the weights? Why is LASSO diamond, the other circular?

I remember seeing that graph a while ago, didn't understand it then, still don't understand it now.

[+] XCSme|9 years ago|reply

This post is definitely not an 101. It says "Move the cursor below to change the value of λ :" but never tells who alpha is. Not newbie friendly.

[+] markovbling|9 years ago|reply

It's lambda - basically the amount of regularization (simplification or 'overfit penalty') to impose. Usually chosen by cross-validation - try every lambda between say 0 and 10 in 0.5 increments and choose lambda that gives the model the lowest error cross validation measure...

[+] mifix|9 years ago|reply

It actually explains what alpha is. The line above its first occurrence it says: "In the case of polynomials". So alpha are the coefficients of the polynomial.

[+] AstralStorm|9 years ago|reply

Why is it called machine learning while it is actually normal statistics and control theory?

There isn't anything in there that learns, there is only a data model and a strict algorithm.

It should rather be called Data Mining.

[+] RockyMcNuts|9 years ago|reply

There both about inference, just slightly different philosophical approaches. Statistics, dating before computers and big data, comes out of math, has a greater emphasis on proofs and closed form solutions, and IMO an opinionated view that what it's modeling is a well-behaved function plus a well-behaved error term.

Machine learning is kind of like statistics for street-fighting. It doesn't care about Bayesian v. frequentist philosophical debates, resorts to cheap tricks like regularization, dropout, using many dumb methods that work better than trying to find the smartest method (ensembles and funny tricks like random forest). If it works in a well-designed test, use it.

There is more than one path to enlightenment. One man's 'data mining' is another's reasoned empiricism, letting the data do the talking instead of leaning on ontological preconceptions of how data is supposed to behave.

Machine learning often works well but you don’t always know why. Stats doesn’t always work and you know precisely why not. (because your data doesn't fit the assumptions of OLS etc.)

[+] wodenokoto|9 years ago|reply

Fitting is called training and when you look at it this way, your algorithm gets better and better at some task the more you train it.thus with practice your algorithm learns. It's just semantics.

So how is it different from statistics. It isn't, but usually the focus in ml is on big data and the accuracy of the prediction, while statistics tend to focus more on the explanation of the data.

I believe the reason for the difference is one sprang out of computer science and revived research into AI and therefore is more hip. The other didn't and is still fighting with its image of being dry and boring.

[+] ma2rten|9 years ago|reply

You are right it's marketing for a large part. The term Machine Learning somehow implies that the algorithms are more intelligent (or even sentient) then they actually are.

That said, Machine Learning is different from statistics. Statistics are mostly concerned with modeling real world phenomena. Machine Learning is trying to build models that can take a training set and generalize to unseen data without building a specific model for that dataset.

[+] bane|9 years ago|reply

Machine Learning is the latest discipline to become a relabeled and conflated Statistics now that Data Science is old news. They used to be separate disciplines, but for some reason Stats keeps looking for a separate label to glom onto in a similar way that Ontologies used to glom onto every sexy new technology in sight.

In this light, most Machine Learning is just a rebadging of Classification theory that sounds cooler.

Somewhere in the future, ML will fall off the hype curve [1] and something new will come along for boring old disciplines to rebadge themselves as. The good news is that these disciplines all leave bits of themselves behind and the things they were pretending to be become better and better defined. It's amazing how few job posting for Data Scientist require a PhD a Stats these days, but it was all the rage for a year or two. Now a reasonable Data Scientist can train up on Corsera in a few weeks.

1 - http://4.bp.blogspot.com/-eL79PoJLFVY/UfSulEQrdfI/AAAAAAAAAw...

[+] unknown|9 years ago|reply

[deleted]

[+] markovbling|9 years ago|reply

There is learning - the model is learnt from the data

[+] ACow_Adonis|9 years ago|reply

An interesting thing about this article, potentially misleading to beginners trying to understand various machine learning and stats techniques, is that despite what the article says, it is not apparent at all that the polynomial model of degree 3 '"sticks" to the data but does not describe the underlying relationship of the data points'.

On the contrary...for this toy example, doesn't it look pretty good! There's really not enough information here to decide whether the model is actually over-fitting or not, and this can easily mislead the beginner into wondering "just why the hell are we taking that awesome model and doing some regularisation thingy to choose a worse model...which is then...better?"

To truly understand, you've got to tackle:

1. What is overfitting? 2. Why/when are too many parameters a problem?

Now...i don't know how intuitive this is for others, but I like to tell people that over-fitting is a fancy word for "projecting too much about the general from the specific".

So why does that matter and what does it have to do with too many parameters?

Well lets say you've got a sample of men and women, and in this case, you're trying to predict underlying rates of breast and testicular cancer (i'm assuming these are primarily gender related for my example), and the "real" relationship is indeed just gender: whether the person is male or female determines the basic underlying risks of these cancers. That's not very many variables for your model. But lets say, in your sample, several of the people with testicular cancer are named "Bob" and several of the people with breast cancer are named "Mary" so you add more variables, binary variables, which indicate whether a person is called "Bob", and whether a person is called "Mary", and suddenly your model prediction amongst your sample for cancer goes through the roof...and yet when you apply it to the population at large, not only did it not predict cancer any better...but suddenly there are all these angry letters from Bobs and Marys who were told they might have cancer. In fact, its doing worse than if you hadn't included those variables at all. What's going on? You overfit.

So you see, in many models, adding in more and more variables can lead you to do better in your sample, but at some point can actually make your model worse. Why does this happen?

Actually, amongst many machine learning and statistical algorithms, there's a pretty intuitive explanation...once its been explained to you.

Lets say that your model only had variables to indicate gender at first, and you come along and you throw in a handful more. You're judging your models performance on its prediction in your sample population. What could many machine learning algorithms do here? Well, for each new variable you introduce, one option is to do absolutely nothing. And if the algorithm chooses to do nothing, what you've actually got is your original gender indicators: you've gained nothing, but you've lost nothing (well, albeit adding more parameter numbers and algorithmic inefficiency/complexity). But most (almost all) methods are not that precise or accurate. So what else could happen? Well, each parameter you add has a small random and statistical chance of increasing your models predictiveness in your sample. We used the example of "Bob" and "Mary", but the people with cancer in your sample could have all sorts of qualities, and as you just through more variables/features at your algorithm, it will eventually hit some that, although having no explanatory power in the population at large, do correlate with statistical quirks of your sample. "Blue eyes", "four toes", "bad breath", "got a paycheck last week" that sort of things. Its a form of data-dredging and its far more widespread professionally than I'd like :P And if you keep throwing variables at it, eventually, many algorithms will choose to naively keep those characteristics that are overly specific to describing your sample, but don't describe the population at large.

And that's why we might want to "regularise". We want there to be a cost of adding variables to the model to make the phenomenon of including statistically spurious variables like this far more unlikely. It is hoped that strong generalisable variables, like male/female, will overcome this cost, while spurious ones added randomly or to game some metric will be less likely to pass that extra hurdle. To use a signal analogy, by implementing a cost for adding more variables, you're filtering out some of the statistical noise to get at the real-loud signal below.

Now, personal anecdote, even though you want to keep models simple (like code, ceteris paribus, and be suspicious of any machine learning/AI technique that uses too many variables...), I don't actually like regularisation often on the whole. In the example, its not actually clear at all that this is a case of over-fitting, and so by following it, you might actually be making your model worse by using it. And in the real world, there's often a number of other techniques that work better (test/train/resampling). But like all techniques, its another arrow in your quiver when the time is right.

And now I've written an essay.

[+] AstralStorm|9 years ago|reply

The main problem is that regularisation here is described in model fitting context. In actual machine learning context, it is usually called data resynthesis or data hallucination.

The main point to be taken is that input data is modified in some way to hide irrelevant detail. Regularisation does that by injecting specific kind of noise into data.

[+] eanzenberg|9 years ago|reply

Don't the fitting coefficients change with varying lambda? Ridge regression varies the fitted coefficients with varying lambda, and lasso can zero out coefficients which don't correlate with the response.

[+] zump|9 years ago|reply

What do you do when your (out of sample) test set is from a difference source from the training set, even though your training loss is low?

[+] hatmatrix|9 years ago|reply

Pray. Or build a new training set including a diverse samples so that you are less likely to enter the regime of extrapolation.

[+] alphaoverlord|9 years ago|reply

What is the difference between this and aic/bic?

[+] selectron|9 years ago|reply

Regularization is the process of modifying the model to limit overfitting, for instance by penalizing more complex models. AIC is a specific application of regularization.

[+] aayushnul|9 years ago|reply

too small and not that informative.. only the sliding one seems to be good.

[+] eggie5|9 years ago|reply

what does too many parameters mean? Too many features/dimensionality?

[+] eanzenberg|9 years ago|reply

It usually means too many knobs to tweak in the algorithm itself, the downside of course being overfitting.

[+] stdbrouw|9 years ago|reply

Yep. It can be hyperparameters ("knobs to tweak" as eanzenberg puts it) but regularization is also often used in a regression context where regularization either pulls parameter estimates for features towards 0 or just kicks out features altogether.

[+] unknown|9 years ago|reply

[deleted]

51 comments