chengtao's comments

chengtao | 11 years ago | on: Machine Learning Done Wrong

To add on top of that, even with great data analysis skill, I had another blog-post talking about it requires all the product, data, and engineering skills together to make a good data science team. http://ml.posthaven.com/why-building-a-data-science-team-is-...

chengtao | 11 years ago | on: Machine Learning Done Wrong

As you pointed out the transforming features is powerful, I believe that's the exact reason which makes SVM powerful. Though the way features can be combined with SVM is limited, the limitation makes SVM training fast in the dual space.

On the other hand, if you wanna compare logistic regression with SVM. While the detail is pretty tricky. One simplified view is to compare linear SVM which is essentially hinge loss with L2 regularization against logistic regression with L2 regularization which is essentially negative binomial log likelihood loss with L2 regularization. If you plot the loss functions, it's easy to see how they penalize negative & positive cases differently.

chengtao | 11 years ago | on: Machine Learning Done Wrong

great comment and +1

chengtao | 11 years ago | on: Machine Learning Done Wrong

Yes, and IMO, most of the time, the insight behind the data is far more important than the modeling algorithms to achieve high performance with few exceptions (say computer vision, NLP, etc which really requires A LOT OF data). Even in some large data set, take page rank as an example. The fundamental insight was the popularity of the site would be a great signal for ranking the search result, and random walk would be a great way to approximate the popularity. As a result, Google made a great success in search ranking.

chengtao | 11 years ago | on: Machine Learning Done Wrong

I personally love the topic of bayesian optimization over all the possible parameters including model choice. My point was more about given the resource is always constrained, it typically pays off long term for practitioners to analyze the data, understand the underlying mechanics before jumping into modeling.

chengtao | 11 years ago | on: Machine Learning Done Wrong

All the other comments are great. Just bear in mind that it's important to really understand the mechanics behind each importance measurement. Some can use information gain, some can use the t-test on coefficient, while some use random forest and see if removing a feature makes big impact, etc. They all make different assumptions and the key point is again, understand whether those assumptions applied to your situation.

chengtao | 11 years ago | on: Machine Learning Done Wrong

Great question and my main point is less about up sampling the rare cases but more about the default loss function used in the model training might not directly align with the final business metric (which is the metric practitioners should care more about). As a result, it's important to align the both. For some algorithms, it's easier to incorporate different loss function, while for some others, it might not be the case. Over or under sampling is one fairly generally applicable way to tweak the loss function.

While I'm not an expert of the theory behind sampling, if you do find the need to tweak sampling to align the default loss function and the business metric, I would say doing grid search first, and validate the result with the business insight, e.g. if you find getting the rare cases right is much more important that getting the common cases right, does that align with the business insight?)

chengtao | 11 years ago | on: Machine Learning Done Wrong

Also, thank you all for reading the post. I'm the author and I'll be happy to clarify any of the points in the blog~

chengtao | 11 years ago | on: Machine Learning Done Wrong

Not really, it was actually more inspired by Statistics Done Wrong.

chengtao | 12 years ago | on: EventHub – An open source event analytics platform

great point and this is the exact rationale behind the architectural design

chengtao | 12 years ago | on: EventHub – An open source event analytics platform

if you are interested, there is also a separate blog post, http://www.codecademy.com/blog/143-eventhub-open-sourced-fun..., in which we talk about some high level architecture consideration