If you're new to ML or datascience, I would recommend working to build a strong basis in Bayesian statistics. It will help you understand how all of the "canonical" ML methods relate to one another, and will give you a basis for building off of them.
In particular, aspire to learn probabilistic graphical models + the libraries to train them (like pyro, tensorflow probability, Edward, Stan). They have a steep learning curve, especially if you're new to the game, but the reward is great.
All of these methods have their place. SVM's have their place, but also aren't great for probability calibration and non-linear SVM's like every single kernel method can scale absolutely terribly. Neural networks have their place, sometimes as a component of a larger statistical model, sometimes as a feature selector, sometimes in and of themselves. They're also very often the wrong choice for a problem.
Don't fall into the beginner trap: sometimes people tend to mistake 'what is the hottest research topic' for 'what is the right solution to my problem given my constraints, (data limitations, time limitations, skill limitations, etc.)'. Be realistic, don't use magical thinking, and have a strong basis in statistics to weed out the beautiful non-bullshit from the bullshit that is frustratingly prevalent (everyone and their mother is an ML expert today).
EDIT: I want to also clarify: I don't mean to suggest the author is new to ML, I just mean this as general advice for anyone coming here who is new to DS/ML. The article looks great!
Personally I'd advise against both SVM's and Bayesian methods for a beginner. Bayesian statistics is very much the deep end of the pool. Graphical models and Bayesian methods generally may make a comeback but such approaches have been superseded by other methods for good reasons, i.e. scaling.
A strong basis in statistics is certainly a great thing, but that can be maximum likelihood plus Bayes law (i.e. "MAP" estimation which is more of a hack to ML than an actual Bayesian method) and provide the big picture for almost everything.
Meanwhile a strong basis in "deterministic methods", as an alternative way to spend that learning effort, has its own rewards. The training algorithms for deep learning are also the hottest algorithm research area in machine learning, and are certainly applicable beyond deep learning. For that matter a thorough understanding of SVM delves into convex optimization, an extremely powerful framework as well.
Different people learn in different ways, but personally I’ve had more success with the opposite approach, ie “top-down”.
As in, rather than learning in depth all the low level parts then finally putting it together at the end, start with a surface high-level understanding of a working prototype then expand into the details of how everything works inside.
In the case of ML, this could mean starting with a 5 line SciKit-learn prototype of a random forest model, seeing some working predictions, then expanding knowledge from there - what data is going in and what is coming out? What’s a classifier? What’s a decision tree? Etc
Starting with PGMs would kill 99.9% of aspiring ML practitioners. Classes related to PGM at Stanford and MIT are considered to be some of the most difficult ones. I'd rather recommend to start with something they are enthusiastic about and once they become sufficiently advanced, to naturally learn (H)PGM.
Great comments. I heartily agree and support the statement about probabilistic graphical models. Just to add a couple more facets to this perspective:
'State of the art' does not always mean 'best for your task', and in fact lately depending on your field SOTA sometimes simply means 'unaffordable' for anyone whose budget is under 1 million dollars.
Try linear methods first.
Ensembles of decent models are usually good models. The point above about probability calibration can be at least somewhat mitigated by using ensemble averages.
Don't just assume "the $MODEL will figure it out" if you give it shitloads of degrees of freedom. Machine learning efficiency all comes down to efficiency of representation, and feature engineering can achieve huge payoffs if/when you incorporate domain knowledge and expertise.
Once you gain a perspective into the "universality" of statistical methods, optimization, and Bayesian probability theory, your work will become a lot easier to reason about. As an example, try to see if you can explain why least-squares fit results from the assumption that model residuals are normally distributed (and what connections this may have to statistical physics!).
Thanks for this insight. Can you kindly also suggest a good book for someone to start with Bayesian Statistics? I could really use a suggestion about first and second book on this.
About Probabilistic Graphical Models, is there book other than Daphne Koller's book that you would suggest?
Requesting best book(s) on probability estimation: techniques, model accuracy, and strategies in applying them (e.g. markets, marketing, business operations)?
Do you have any learning resources to recommend for Baysian ML? Especially interested in more applied stuff, and ideally temporal and spatial modeling.
Thanks for the advice. Will definitely try to follow that. I was trying to learn basics of statistics and went through most of the intro to statistical learning. Will complete the rest in few days.
I am more of a book person, if you have any other resource for probabilistic graphical models, please share here.
Stay away, in my opinion. I spent a year supporting a SVM in a production machine learning application, and it made me wish the ML research community hadn't been so in love with them for so long.
They're the perfect blend of theoretically elegant and practically impractical. Training scales as O(n^3), serialized models are heavyweight, prediction is slow. They're like Gaussian Processes, except warped and without any principled way of choosing the kernel function. Applying them to structured data (mix of categorical & continuous features, missing values) is difficult. The hyperparameters are non-intuitive and tuning them is a black art.
GBMs/Random Forests are a better default choice, and far more performant. Even simpler than that, linear models & generalized linear models are my go-to most of the time. And if you genuinely need the extra predictiveness, deep learning seems like better bang for your buck right now. Fast.ai is a good resource if that's interesting to you.
Linear models are simpler. GBMs are more powerful, more flexible, and faster.
Every ML course I took had 3 weeks of problem sets on VC dimension and convex quadratic optimization in Lagrangian dual-space, while decision tree ensembles were lucky to get a mention. Meanwhile GBMs continue to win almost all the competitions where neural nets don't dominate.
I suspect my professors just preferred the nice theoretical motivation and fancy math.
Kernel function is simple - Are you in a high dimensional space? If so, choose linear kernel. Else? Choose the most non-linear one you can (usually a guassian or RBF). I suppose quadratic and the other kernals are useful if what your modeling looks like that but in practice that is rare.
Prediction is not that slow with linear SVMs especially not compared to something like K-NN. The main hyperparamaters which matter are the "C" value and maybe class weights if you have recall or precision requirements. The C value is something that should be grid-searched, but you might as well be grid-searching everything that matters on every ML algorithm and in this regard SVMs are fast to iterate over (because the C value is all that matters).
Applying categorical and continuous features is not difficult if you choose to do it in anything more sophisticated than sklearn. Also, pd.get_dummies() exists (though it may lead to that slow prediction you're concerned about)
You're most likely right with GBM or Random Forests - though they can have all sorts of issues with parallelism if you're not on the right kind of system. You talk about linear models but SVMs are usually using linear kernals anyway and are a generalization of linear models (including lasso and ridge regression models).
I’d agree on the training time but your serialized model should be small on disk since only the support vectors are needed for inference. At least with my experience that has been true.
ITT: Whether SVMs are still relevant in the deep learning era. Some junior researchers will say neural networks are all you need. Industry folks will talk about how they still use decision trees.
Personally, I'm quite bullish on the resurgence of SVMs as SOTA. What did it for me was Mikhail Belkin's talk at IAS.[1]
I mean NNs are still quite bad at low n tabular data (and they may always be), which is honestly how a lot of real life data is, so there is clearly a need for not a neural network.
I feel like I've seem more tree ensembles in the wild than SVMs, though.
I've been an ML practioner since 2009. I've used every method imaginable or popular, I think. With the exception of non-linear SVMs. Linear SVM => All good, just the hingle loss optimization. Non-linear SVM, a bit of overkill with basis expansion. Just too slow, or too complex a model?
My impression: SVMs are more of theoretical interest than practical interest. Yeah, learn your statistics. Loss functions. Additive models. Neural nets. Linear models. Decision trees, kNNs etc. SVM is more of a special interest, imho.
We can definitely learn a piece from such an experienced practitioner. Thanks for sharing, I think your intuition matches with the other experienced once in the comments.
[+] [-] rusty-rust|6 years ago|reply
[+] [-] astrophysician|6 years ago|reply
In particular, aspire to learn probabilistic graphical models + the libraries to train them (like pyro, tensorflow probability, Edward, Stan). They have a steep learning curve, especially if you're new to the game, but the reward is great.
All of these methods have their place. SVM's have their place, but also aren't great for probability calibration and non-linear SVM's like every single kernel method can scale absolutely terribly. Neural networks have their place, sometimes as a component of a larger statistical model, sometimes as a feature selector, sometimes in and of themselves. They're also very often the wrong choice for a problem.
Don't fall into the beginner trap: sometimes people tend to mistake 'what is the hottest research topic' for 'what is the right solution to my problem given my constraints, (data limitations, time limitations, skill limitations, etc.)'. Be realistic, don't use magical thinking, and have a strong basis in statistics to weed out the beautiful non-bullshit from the bullshit that is frustratingly prevalent (everyone and their mother is an ML expert today).
EDIT: I want to also clarify: I don't mean to suggest the author is new to ML, I just mean this as general advice for anyone coming here who is new to DS/ML. The article looks great!
[+] [-] unishark|6 years ago|reply
A strong basis in statistics is certainly a great thing, but that can be maximum likelihood plus Bayes law (i.e. "MAP" estimation which is more of a hack to ML than an actual Bayesian method) and provide the big picture for almost everything.
Meanwhile a strong basis in "deterministic methods", as an alternative way to spend that learning effort, has its own rewards. The training algorithms for deep learning are also the hottest algorithm research area in machine learning, and are certainly applicable beyond deep learning. For that matter a thorough understanding of SVM delves into convex optimization, an extremely powerful framework as well.
[+] [-] rjvir|6 years ago|reply
As in, rather than learning in depth all the low level parts then finally putting it together at the end, start with a surface high-level understanding of a working prototype then expand into the details of how everything works inside.
In the case of ML, this could mean starting with a 5 line SciKit-learn prototype of a random forest model, seeing some working predictions, then expanding knowledge from there - what data is going in and what is coming out? What’s a classifier? What’s a decision tree? Etc
[+] [-] bitL|6 years ago|reply
[+] [-] uoaei|6 years ago|reply
'State of the art' does not always mean 'best for your task', and in fact lately depending on your field SOTA sometimes simply means 'unaffordable' for anyone whose budget is under 1 million dollars.
Try linear methods first.
Ensembles of decent models are usually good models. The point above about probability calibration can be at least somewhat mitigated by using ensemble averages.
Don't just assume "the $MODEL will figure it out" if you give it shitloads of degrees of freedom. Machine learning efficiency all comes down to efficiency of representation, and feature engineering can achieve huge payoffs if/when you incorporate domain knowledge and expertise.
Once you gain a perspective into the "universality" of statistical methods, optimization, and Bayesian probability theory, your work will become a lot easier to reason about. As an example, try to see if you can explain why least-squares fit results from the assumption that model residuals are normally distributed (and what connections this may have to statistical physics!).
[+] [-] tesseract2|6 years ago|reply
About Probabilistic Graphical Models, is there book other than Daphne Koller's book that you would suggest?
[+] [-] antipaul|6 years ago|reply
Want to whet my appetite for your suggestion.
[+] [-] mrslave|6 years ago|reply
[+] [-] jononor|6 years ago|reply
[+] [-] rangerranvir|6 years ago|reply
I am more of a book person, if you have any other resource for probabilistic graphical models, please share here.
[+] [-] sriram_malhar|6 years ago|reply
[+] [-] unknown|6 years ago|reply
[deleted]
[+] [-] smbrian|6 years ago|reply
They're the perfect blend of theoretically elegant and practically impractical. Training scales as O(n^3), serialized models are heavyweight, prediction is slow. They're like Gaussian Processes, except warped and without any principled way of choosing the kernel function. Applying them to structured data (mix of categorical & continuous features, missing values) is difficult. The hyperparameters are non-intuitive and tuning them is a black art.
GBMs/Random Forests are a better default choice, and far more performant. Even simpler than that, linear models & generalized linear models are my go-to most of the time. And if you genuinely need the extra predictiveness, deep learning seems like better bang for your buck right now. Fast.ai is a good resource if that's interesting to you.
[+] [-] _pastel|6 years ago|reply
Linear models are simpler. GBMs are more powerful, more flexible, and faster.
Every ML course I took had 3 weeks of problem sets on VC dimension and convex quadratic optimization in Lagrangian dual-space, while decision tree ensembles were lucky to get a mention. Meanwhile GBMs continue to win almost all the competitions where neural nets don't dominate.
I suspect my professors just preferred the nice theoretical motivation and fancy math.
[+] [-] Der_Einzige|6 years ago|reply
Prediction is not that slow with linear SVMs especially not compared to something like K-NN. The main hyperparamaters which matter are the "C" value and maybe class weights if you have recall or precision requirements. The C value is something that should be grid-searched, but you might as well be grid-searching everything that matters on every ML algorithm and in this regard SVMs are fast to iterate over (because the C value is all that matters).
Applying categorical and continuous features is not difficult if you choose to do it in anything more sophisticated than sklearn. Also, pd.get_dummies() exists (though it may lead to that slow prediction you're concerned about)
You're most likely right with GBM or Random Forests - though they can have all sorts of issues with parallelism if you're not on the right kind of system. You talk about linear models but SVMs are usually using linear kernals anyway and are a generalization of linear models (including lasso and ridge regression models).
[+] [-] exegete|6 years ago|reply
[+] [-] MaxBarraclough|6 years ago|reply
[+] [-] bitforger|6 years ago|reply
Personally, I'm quite bullish on the resurgence of SVMs as SOTA. What did it for me was Mikhail Belkin's talk at IAS.[1]
[1] https://m.youtube.com/watch?index=15&list=PLdDZb3TwJPZ5dqqg_...
[+] [-] stu2b50|6 years ago|reply
I feel like I've seem more tree ensembles in the wild than SVMs, though.
[+] [-] ma2rten|6 years ago|reply
[+] [-] rangerranvir|6 years ago|reply
[+] [-] starchild_3001|6 years ago|reply
My impression: SVMs are more of theoretical interest than practical interest. Yeah, learn your statistics. Loss functions. Additive models. Neural nets. Linear models. Decision trees, kNNs etc. SVM is more of a special interest, imho.
[+] [-] rangerranvir|6 years ago|reply
[+] [-] zetazzed|6 years ago|reply
[+] [-] amrrs|6 years ago|reply
[+] [-] rangerranvir|6 years ago|reply
[+] [-] rangerranvir|6 years ago|reply
If someone has a suggestion on how I can improve the user experience feel free to hop in and let me know.
[+] [-] unknown|6 years ago|reply
[deleted]