Machine Learning Exercises in Python, Part 1

jupiter90000|9 years ago

Often this sort of material seems to be a collection of methods and understanding them, which is obviously important to being able to use them. However, I usually feel like the example problems are much cleaner and simpler than those I've encountered in business. I feel like there's this missing link between learning the methods and doing something that actually adds significant value for a business using machine learning. Perhaps it's just me or my field though.

I found that usually lots of work involved just transforming or examining data in relatively simple ways or using human expert decisions as to important threshholds for outliers. For example I could run an outlier algorithm on data and either the returned outliers were very obvious and could have been found using a manual query by knowing the business context, or it returned alot of false positive outliers that were useless for the business. Other times, we'd have a predictive model that was good for 95% of cases but would make our company look ridiculous on predictions for the other 5%, so couldn't use it in production-- and the nature of the data was such that we couldn't use the model for only certain value ranges.

Perhaps it was just the nature of our realm of business (telecom), and these approaches are more useful for others (advertising, stock trading, etc). Any experience with business fields where this stuff made a sizable impact for something they productionized in business they can share?

jim-greer|9 years ago

Depending on the business needs, returning outliers can be useful even if there are a bunch of false positives.

I'm not a machine learning guy, but when I was at Kongregate, we had a problem with credit card fraud on our virtual goods platform. It wasn't serious fraudsters, just dipshit teens with their parent's credit card.

I had labeled data: historical transactions, with chargebacks, which I fed into Weka. I included all kinds of stuff we knew about the user. A simple rule-based classifier could pick out risky transactions, with a lot of false positives.

I made a simple tool for our customer service team to review these risky transactions. They would decide whether to warn the user, temporarily block them from buying or temp ban them, or permanently ban them.

This worked pretty well for us. The risk factors were new players, players spending quickly, and users who were dicks - as measured by how often others had muted them in chat, how often they swore in chat, etc.

As an aside, saying "fuck" or "shit" in chat wasn't very predictive of fraud - often those terms aren't signs of an abusive user, since they might just be saying "fuck, I suck at this game". What was predictive was users who said "Gay", "Penis", or "Rape". People who use those terms on a game platform are largely dickheads. So the score for abusiveness became known as the "Gay, Penis, Rape Score" or "GPR" for short.

tixocloud|9 years ago

I've had similar experience in insurance. Our predictive algorithms have been used sparingly and guides our strategy but we don't fully trust the actual data. That's how we leverage our analysis.

For us, small increments does give us sizeable impact. And we don't aim for predicting 100% of the cases either. We take what we get and see how we can use it.

In business, we don't care about accuracy. We care about improvement.

shamino|9 years ago

Chiming in to say that I have the same exact experience :) I work in security, and we use these methods to detect anomalies or classify malicious content or URLs. A silly false positive is embarrassing, even if it happens once. Humans always augment our methods, or we have to set expectations to the customer that we are trading off accuracy for speed. Fast customer support usually helps against false positives too.

wodenokoto|9 years ago

While I agree that data munging is very important and very difficult, I disagree that it should be part of every course teaching any kind of data manipulation.

I took a course called data mining at university and it largely consisted of munging data.

Biased by that one course, I would expect anything called "data mining" to contain a lot of practice and theory about cleaning data and a machine learning course to focus on what to do with the cleaned data.

leereeves|9 years ago

These are just introductory courses, teaching the theory.

Teaching best practices for applying these methods to particular fields is probably beyond the expertise of any one person. Perhaps there's an opportunity for professors or practitioners of each field here?

denfromufa|9 years ago

I would argue that if you the physics behind the problem, then even semi-empirical models easily beat machine learning. I have seen this consistently on my data-sets.

Animats|9 years ago

I took that course from the pre-Coursera Stanford videos, when someone from Black Rock Capital taught the course at Hacker Dojo. Did the homework in Octave, although it was intended to be done in Matlab.

It was painful. Those videos are just Ng at a physical chalkboard, with marginally legible writing. All math, little motivation, and, in particular, few graphics, although most of the concepts have a graphical representation.

CloudYeller|9 years ago

Spot on. I respect the depth of Ng's knowledge, but for 99% of people, knowing how to implement a linear regression algorithm is completely useless. Hardly anyone is trying to write a better ML algorithm; the rest of us just need to import code that was written by PhD's. So it's far better to understand higher level concepts like when you should use a certain ML method, what assumptions go into it, and generally how the underlying algorithms work.

capkutay|9 years ago

Agreed. Though this is pretty consistent with college CS/Math courses in general (at least in my experience). A lot of dense theoretical content covered in scribbles and slides. You don't really learn anything until you just do practice problems or research the same topics independently.

aptwebapps|9 years ago

The current Coursera course's videos are pretty unadorned, but he's not using a physical chalkboard any more. I also found that for most of them I can use the subtitles instead of the audio and play them back about about 2x speed.

kochthesecond|9 years ago

Heh, this is how my CS classes were.

fitzwatermellow|9 years ago

During the time of the original class, I don't think scikit-learn and spark were quite as mature. But perhaps Octave still enjoys a certain prominance in academic machine learning research. Matlab was also used for the recent EdX SynthBio class. And it just feels a bit archaic now, doing science in a gui on the desktop, instead of on a cloud server via cli ;)

ivansavz|9 years ago

Related, the demos from Kevin P. Murphy's excellent ML book implemented in Octave [1] and (partially) in Python[2].

[1] https://github.com/probml/pmtk3/tree/master/demos [2] https://github.com/probml/pmtk3/tree/master/python/demos

mark_l_watson|9 years ago

Very nice. I took the class twice and think it is easiest to use Octave, but for after taking the class these Python examples might help some people.

jjallen|9 years ago

Seems like to compensate for day to day weight/water fluctuations one would need to track the trailing activity and food data for a period of days prior to the data analyzed. I'm thinking 3-5.

.2 lbs/kilos lost is mostly a rounding error. Our weight could fluctuate that much on a daily basis from the amount of salt consumed.

Noseshine|9 years ago

I think you clicked on the wrong thread, you probably wanted to post here:

    > Machine Learning and Ketosis

https://news.ycombinator.com/item?id=12279415

NelsonMinar|9 years ago

Ng's machine learning class is excellent, but the main thing holding it back is its use of Matlab/Octave for the exercises. A Python version (with auto-grading of exercises) would be a huge improvement.

motyar|9 years ago

Can I find same in R?

Noseshine|9 years ago

Try

https://www.edx.org/course/applied-machine-learning-microsof...

https://lagunita.stanford.edu/courses/HumanitiesSciences/Sta...

> This is not a math-heavy class, so we try and describe the methods without heavy reliance on formulas and complex mathematics. We focus on what we consider to be the important elements of modern data analysis. Computing is done in R. There are lectures devoted to R, giving tutorials from the ground up, and progressing with more detailed sessions that implement the techniques in each chapter.

denfromufa|9 years ago

What is the best learning resource for gaussian process (kriging) using Python?

0xmohit|9 years ago

Have you seen Gaussian Processes for Machine Learning [0]?

The entire text is freely available online at the mentioned URL.

[0] http://www.gaussianprocess.org/gpml/

earthpalm|9 years ago

Lets talk about about how much Michael I. Jordon taught Andrew Ng what he knows about machine learning and AI.

freyr|9 years ago

He's the Michael Jordan of machine learning, after all.

danjoc|9 years ago

https://www.coursera.org/about/terms/honorcode

I will not make solutions to homework, quizzes, exams, projects, and other assignments available to anyone else (except to the extent an assignment explicitly permits sharing solutions). This includes both solutions written by me, as well as any solutions provided by the course staff or others.

jdwittenauer|9 years ago

None of the material in these posts could be used directly to complete assignments for the class. I suppose someone could attempt to "back-port" some of the Python code to Octave, but if you're going to that much trouble it's probably easier to just solve it in Octave in the first place.

minimaxir|9 years ago

The course itself uses Octave; the OP just ported the code to Python.

jamra|9 years ago

They were supposed to be free so they blinked first

62 comments