What it takes to build great machine learning products

[+] gfodor|14 years ago|reply

A great and insightful article. A common theme I've seen in practice is folks who have a deep understanding of ML often run straight to applying the most sophisticated algorithms possible on raw data. On the other hand, people who know a bit about ML but understand the domain better start by applying intuition to data cleansing and then follow up with simpler algorithms. Without fail the latter group ends up with better results.

[+] jmount|14 years ago|reply

My definition of a "deep understanding of ML" definitely excludes people who immediately try "the most sophisticated algorithm." Buzzword jockies try all the cool stuff first, but most practitioners I have met try basic statistical methods first. Then when they see what issue they need to overcome they bring in a method designed to help with that issue.

[+] ma2rten|14 years ago|reply

Really? I'd think that most ML people are well aware of the importance of data cleansing and feature extraction. Also my experience is that domain knowledge often (but not always - depends on the domain) helps surprisingly little. Feature extraction is mostly an iterative approach anyway: you define some very simple features, you look at the mistakes, you add some features and repeat until you are happy. Ideally you also do some visualization in there somewhere.

[+] tel|14 years ago|reply

I think there are essentially two "deep" understandings of ML prevalent today. The first is more common: the ability to do the calculus, algebra, and probability derivations required to design complex ML algorithms combined with the CS knowledge to find/design a good algorithm and the software design skill to actually implement it on real, "big" data.

No doubt this is a difficult position to master and those who perform well are able to tackle lots of mathematical and computational challenges. They also are model builders who (have tendency to) relentlessly seek complex models in order to solve complex problems.

The other, rarer side is the learning theorist who may or may not understand the model building, algorithmic, and computational tools but understands well the theories which allow us to have reasonable expectations that the tools of the first group will work at all. These guys have a funny story in that they were the old statisticians who got a major egg-on-the-face after proclaiming that essentially all of ML was impossible. Turns out the first group managed to redefine the problem slightly and make major headway (and money).

---

The thing I want to bring to light however is that the second group knows the math that bounds the capacities of ML algorithms. This isn't easy. It's one thing to say you recognize that the curse of dimensionality exists, but it's another to have felt it's mathematical curves and to build an intuition for what forces are sufficient to cause disruption.

The more experience you have with the learning maths, the more likely you are, I feel, to apply very simple algorithms, to be scared of "little x's" (real data) enough to treat it with great care, and to attempt to explore the problem space with confidence for what steps will lead you to folly.

---

It's a fine line between the two, though. Stray too far to the first group and you'll spend a month building an algorithm that does a millionth of a percentage point better than Fisher's LDA. Spend too much time in the second camp and you'll confidently state that no algorithm exists that does better than a millionth of a percentage point over Fisher LDA... and then lose purely by never trying.

[+] irahul|14 years ago|reply

> On the other hand, people who know a bit about ML but understand the domain better start by applying intuition to data cleansing and then follow up with simpler algorithms.

I find data cleansing(if you are including feature selection) hard, and I consider it a refinement. If I am working on a classification problem, I start with naive bayes with a trivial feature generator(if words are feature, split on whitespace and discard some symbols), train it, and cross validate. Depending on the results of the cross validation on differently sized data-sets(say 100 tweets, 200, 500, 1000, 2000, 5000) I decide if I refine bayes further or I need to pick another algorithm.

I avoid SVM because I have a hard time figuring out the kernel and relation between data. I mostly don't use linear classifiers because the relation is very rarely linear.

Generally if the features are pseudo-independent(naive bayes assumes independent events but it might work fine even if the events aren't independent), naive bayes does the job. If not, it's time to refine the feature generator and selector.

[+] joe_the_user|14 years ago|reply

That has been my experience with AI programming.

But I would take a more pessimistic interpretation of this.

That is: all our "learning algorithms" has failed to learn and those with some clever heuristics succeed versus the broken methods we so-far have.

[+] mturmon|14 years ago|reply

Agreed. Another ingredient is sustained engagement with the problem, so that your algorithm works not just for a pre-selected demo, but actually provides noticeable performance gains for real data.

[+] tgflynn|14 years ago|reply

I agree that the big wins in machine learning/(weak)AI are probably going to come more from figuring out how to better apply existing models and algorithms to real problems rather than from improving the performance of the algorithms themselves.

That said one shouldn't underestimate the amount of commonality between problems that to some people may appear unrelated. For example this post talks about the gains in machine translation performance from including larger contexts. The same principle applies to many other sequence learning problems. For example you have a very similar issue with handwriting recognition where it is often not possible (even for a human) to determine the correct letter classification for a given handwritten character without seeing it within the context of the word.

[+] chaostheory|14 years ago|reply

The article is light on details. imo there are two major things your team needs:

1) Programmers that have the needed math skills, or mathematicians with the needed coding skills

2) A distributed ML framework

Solving problem one is not easy but it's straightforward.

Solving problem two is harder. While there are a lot of open source machine learning projects, almost all of them seem to have a focus of being used by a person and not a program. Moreover very few do distributed processing except for mahout (http://mahout.apache.org/). Mahout is promising but the documentation is still thin, and I'm not sure if it's getting momentum in terms of mind share yet.

[+] suneilp|14 years ago|reply

What kind of math skills? What would a programmer need to learn in order to work on ML stuff?

[+] Dn_Ab|14 years ago|reply

In case anyone clicks his link to Variational Methods and is confused to find an article on quantum mechanics, as inspirational and arguably related as it may be, I think he actually meant to link to: http://en.wikipedia.org/wiki/Variational_Bayesian_methods

[+] ma2rten|14 years ago|reply

Right now NLP is mostly limited to niche applications, like e.g. sentiment analysis and clever products build around it. I actually think the reason is that both natural language processing and machine learning are still in their early days.

Imagine all the applications for consumer products if algorithms would be really able to actually understand language (as far as you can understand something if you are a computer program and not a sentient human being), e.g. for example if we were able to do real text summarization.

I believe this is not only possible, but not as far away as people think. However, to reach that goal we need to let go of that idea that NLP is mostly about clever feature engineering, but instead start building algorithm that derive those features themselves. Part of the problem is how evaluation is setup in NLP. What the best algorithm is, is decided based on who gets the best performance on some dataset. This sounds all nice and objective, but you will always be able to get the best performance if you try enough combinations of features (overfitting the testset) [1]. These small improvements say little about real world performance.

For the NLP people among you this is an interesting paper that tries to do a lot of things different: http://ronan.collobert.com/pub/matos/2008_nlp_icml.pdf

This is the corresponding tutorial, which is quite entertaining as well: http://videolectures.net/nips09_collobert_weston_dlnl/

[1] I think, this is less true for machine translation, where there are more and bigger testsets and less feature engineering going on.

[+] brendano|14 years ago|reply

Careful with the Collobert ICML-2008 paper. It has a very negative reputation among NLP researchers who actually know the area, just for its setup/evaluation. If you're interested in the methods (which I think are interesting), that group's later work is much improved.

[+] ogrisel|14 years ago|reply

Very nice article Aria. You quickly mention Pegasos as a scalable alternative to SMO. I agree that this works well for linear models. But despite the claim that Pegasos can be trivially adapted to kernel models I have never seen any implementation of a kernel Pegasos and I don't understand how it's even possible. Have you used Pegasos-style algorithm to fit non linear models?

On the other hand there exist alternatives such as LaSVM that can effectively scale linearly to large datasets (but the optimizer works in dual representation as with SMO and not like Pegasos).

[+] psb217|14 years ago|reply

You may want to look at the paper: "P-packSVM: Parallel Primal grAdient desCent Kernel SVM" from ICDM 2009. It presents an extension of Pegasos to non-linear kernels. Evaluating the pairwise kernels <x_i,x_j> and continuously updating the estimate of the norm of the implicit weight vector w seem to be the main hurdles to achieving the performance gains seen with linear kernels.

The key takeaway from the paper (for me) was that the computation time on a single processor was not significantly better than that of the standard implementation provided by SVM-Light. However, with a variety of tricks permitted by the use of an SGD/Pegasos-like method, the authors were able to get significant speedup when using a compute cluster, allowing a good reduction in computation times (e.g. ~200x reduction on 512 processors).

[+] brendano|14 years ago|reply

For NLP applications, which I think Aria's article is mostly concerned with, non-linear kernelized classifiers are often little better than linear ones. I think that's one part of the recent interest in SGD-style training algorithms (they work for linear cases nicely, less so for kernelized ones).

[deleted part about kernelizing pegasos, realized i dont know that area]

[+] srconstantin|14 years ago|reply

Link to LaSVM paper: jmlr.csail.mit.edu/papers/volume6/bordes05a/bordes05a.pdf Also a good overview of SVM techniques in general.

[+] srconstantin|14 years ago|reply

So...are you saying you need the dual formulation in order to allow a kernel model?

[+] TimPC|14 years ago|reply

It's a very exciting time. I'm incredibly excited to see what goes on here. I previously explored an online education start-up idea and I'm really looking forward to seeing Ng and Koller change the world. I'm also very exciting to see machine learning on the radar. For me one of the biggest challenges is often making AI intuitive. As machine learning becomes more mainstream it will be on people's design radar and that will make it less hard to turn great algorithms into great products.

[+] 3pt14159|14 years ago|reply

Partially, although in my experience over the past 4 years doing this stuff 1 hour cleaning the input data gets you thrice the output of 1 hour tuning the algos. Some algorithms are more sensitive than others, but in general, garbage in, garbage out.

[+] mailshanx|14 years ago|reply

I think this is pretty accurate. Here is an example from my own thesis research: I'm using machine learning to tune an (underwater) communication link, i.e. decide what modulation / error coding algorithms/parameters will yield good data rates in a dynamic channel.

At first i tried using an off-the-self classifier to figure out which parameters will work well. That failed because by the time i had sampled a decent proportion of the possible parameter values, the channel would change (the number of possible combinations is of the order of a few millions).

Turned out that the real problem is not learning the performance of the available parameters, rather it lies in "learning how to learn": i.e. my ML system needs to adaptively search the space, by responding to the history of previous explorations and their outcomes. This kind of exploration would be effective only with an understanding of how the underlying modulation/coding algorithms work and interact with each other.

[+] microtonal|14 years ago|reply

Indeed. From our own experience: we use pretty much off-the-shelf maximum entropy parameter estimators for parse disambiguation and fluency ranking. In the past ~10 years most of the gain has come from smart feature engineering by using linguistic insights, analyzing common classes of classification errors, etc. Beyond l1 or l2 regularization, the use of (even) more sophisticated machine learning algorithms/techniques have not yet given much, if any, improvement for these tasks in our system.

What did help in understanding models is the application of newer feature selection techniques that give a ranked list of features, such as grafting.

[+] mwexler|14 years ago|reply

Reading this reminded me of this recent post by Chris Dixon, which is also a good read: http://cdixon.org/2012/04/14/there-are-two-ways-to-make-larg...

[+] seamusabshere|14 years ago|reply

My for-profit company (Brighter Planet) often gets product ideas from our data scientists; it's exactly what Dr. Haghighi is talking about.

For example: trying to model environmental impact of Bill Gates's 66,000 sq ft house during a hackathon -> discovery that we need fuzzy set analysis (https://github.com/seamusabshere/fuzzy_infer) -> new, marketable capabilities in our hotel modelling product (https://github.com/brighterplanet/lodging/blob/master/lib/lo...).

[+] salimmadjd|14 years ago|reply

I have enjoyed the author's other posts via his prismatic blog here. It's one of the most interestings blogs to follow with only a few posts so far. However, this article falls a bit short. It feels rushed out, which is understandable.

I think it would have been better if this was just the first part of a multi-article write up on ML. With this one being an intro and follow-ups on specific approaches.

[+] junktest|14 years ago|reply

Probably try PCA (principal components analysis) to help select the most important features of data, first, before going further in modeling it.

[+] marshallp|14 years ago|reply

The article doesn't mention two important things (and instead focuses on being clever - the opposite of what machine learning stands for). First, the deep learning algorithms that automatically create features. Second, the importance of gathering lots of data, or generating it.

If you have to be really clever with feature engineering, then what's the point of even calling yourself a machine learning person.

[+] ogrisel|14 years ago|reply

I agree that deep-learning is an interesting approach to learn higher level features. However it's still a long way from being a universal solution: for instance deep learning won't help you solve the machine translation or multi-documents text summarization problems automagically: you still need to find good (hence often task dependent) representation for both your input data and the datastructure you are trying to learn a predictive model for.

[+] grinalds|14 years ago|reply

Deep learning is an interesting approach - although the features that DL algorithms decide are most important are not always intuitive or weighted properly in context. Partial-feature engineering is sometimes the only way to effectively deal with biases, especially in higher-dimensional space where the DL features can be very opaque.

47 comments