top | item 12924020

Introduction to Machine Learning for Developers

547 points| erinjerri1678 | 9 years ago |blog.algorithmia.com | reply

31 comments

order
[+] rav|9 years ago|reply
The description of Naive Bayes is misleading. Almost all supervised learning problems assume that "Inputs are classified in isolation where no input has an effect on any other inputs" (quote from the article), but that's not why Naive Bayes is called naive.

The naive assumption made by Naive Bayes is that the features (or attributes) of each input point are independent. Let me explain from a simple example:

Suppose you want to find people who receive benefits they are not entitled to. The input data might have two attributes: cash on bank account, and amount received in benefits. Although you could look for data that have a high value in both attributes, the naive assumption made in Naive Bayes says that you can in fact make your classification without correlating multiple attributes; Naive Bayes assumes you can explain the labeling of data just by looking at attributes in isolation. In this example, this assumption is clearly unfounded, since if you only look at benefits or only at cash balance, you won't be able to tell how a person should be classified.

The data independence assumption made by almost all ML algorithms is that different data points are not correlated: the label of a single data point (person in the above problem) does not depend on the attributes of other data points.

[+] machineman44|9 years ago|reply
I agree. Most supervised learning classifiers are derived based on the independent and identically distributed assumption for each (x,y) pair.

To be more specific about the Naive Bayes assumption, the features of a data point are conditionally independent instead of simply independent. This means that given a certain label, these set of features are independent.

[+] princesspea|9 years ago|reply
Hi, I'm Stephanie Kim and wrote the talk/post. Thanks for the comment! Yes you are correct I should have specified that it is the features of each input rather than the inputs that are regarded as independent from one another! I will revise that in the post. Again, thanks for pointing that out since it's an important distinction, especially for people just starting out!
[+] Bedon292|9 years ago|reply
I cannot recommend scikit-learn enough to anyone interested in machine learning who likes python. I have been working with it for part of my Thesis, and it can do so much, with so little code. It is amazing.
[+] atsheehan|9 years ago|reply
If anyone is interested in learning more about scikit-learn, I'd recommend "Hands-On Machine Learning with Scikit-Learn and Tensorflow" from O'Reilly:

http://shop.oreilly.com/product/0636920052289.do

When I first started using scikit-learn, I was overwhelmed with the number of classes and options available. I just chose some basic classifiers I was familiar with and stuck with most of the default settings. The book explains many of the other models and when they would be useful, but also spends a lot of time exploring the datasets (using pandas), preprocessing data and building data pipelines, finding the best hyperparameters, best ways to evaluate a models performance, etc. The library feels less like a big bag of algorithms now and more like a cohesive data pipeline.

[+] avitzurel|9 years ago|reply
I second this.

I've been somewhat addicted to HackerRank challenges over the last couple of weeks. Why is not important, don't judge :)

The python packages and tooling around learning and science are truly amazing. Try and do the Craigslist category classification without using python and see what I mean.

[+] peterhadlaw|9 years ago|reply
Although I did my studies with NLTK, it looks like spaCy has stepped up to the plate, particular with NLP related tasks.

Worth checking out, in parallel to or in place of NLTK: https://spacy.io

[+] voiceclonr|9 years ago|reply
Never heard of it. Thanks for the pointer!
[+] anton_tarasenko|9 years ago|reply
I've made a similar list for economists. It included a list of practical applications of ML. Developers can get a sense of what the discipline can do before jumping in.

APPLIED MACHINE LEARNING CASES

## Business

1. Kaggle, Data Science Use cases. An outline of business applications. Few companies have the data to implement these things. https://www.kaggle.com/wiki/DataScienceUseCases

2. Kaggle, Competitions. (Make sure you chose “All Competitions” and then “Completed”.) Each competition has a leaderboard. When users publish their solutions on GitHub, you can find links to these solutions on the leaderboard. https://www.kaggle.com/competitions

Industrial solutions are more powerful and complex than these examples, but they are not publicly available. Data-driven companies post some details about this work in their blogs.

## Emerging applications

1. Stanford’s CS229 Course, Student projects. See “Recent years’ projects.” Hundreds of short papers. http://cs229.stanford.edu/

2. CMU ML Department, Student projects. More advanced problems, compared to CS229. http://www.ml.cmu.edu/research/data-analysis-projects.html

3. arXiv, Machine Learning. Drafts of important papers appear here first. Then they got published in journals. http://arxiv.org/list/stat.ML/recent

4. CS journals. Applied ML research also appear in engineering journals. https://scholar.google.com/citations?view_op=top_venues&hl=e...

5. CS departments. For example: CMU ML Department, PhD dissertations. http://www.ml.cmu.edu/research/phd-dissertations.html

## Government

1. Bloomberg and Flowers, “NYC Analytics.” NYC Mayor’s Office of Data Analysis describes their data management system and improvements in operations. http://www.nyc.gov/html/analytics/downloads/pdf/annual_repor...

2. UK Government, Tax Agent Segmentation. https://www.gov.uk/government/uploads/system/uploads/attachm...

3. Data.gov, Applications. Some are ML-based. http://www.data.gov/applications

4. StackExchange, Applications. http://opendata.stackexchange.com/questions/3346/examples-of...

## See also

The original article: https://antontarasenko.com/2015/12/28/machine-learning-for-e...

A related list of cases: https://www.quora.com/What-are-some-practical-applications-o...

[+] lonewolf_ninja|9 years ago|reply
I recently started playing around with the data sets on past Kaggle competitions and have been learning a lot.

The Data Science use cases there are quite interesting. Are there any publicly available data-sets (other than the ones available in competitions) to work with (especially for the marketing use cases)?

[+] mi100hael|9 years ago|reply
Cool, this is a helpful intro. Anyone have any recommended reading for ML in a JVM context?
[+] otoburb|9 years ago|reply
The Skymind.io co-founders wrote a book that references their open-source "deep learning for Java and Scala framework"[1]. They are in the YC16 batch.

[1] https://deeplearning4j.org/about

[+] pineapple_sauce|9 years ago|reply
In the slides for unsupervised learning, what is meant by "Maximum Entropy"? Doesn't this just imply that the distribution will be uniform; i.e. it's no better than making a blind guess?
[+] machineman44|9 years ago|reply
I have only seen a maximum entropy model as part of the supervised realm where it is a discriminative model. In other words, given some labeled data, we can draw a decision boundary. Maximum entropy in this context is almost certainly associated with the information theory definition, where the entropy of a collection of data based on the distribution of classes is measured. High entropy if each class is equally probable. Lower Entropy otherwise.
[+] machineman44|9 years ago|reply
Honestly, this is a good run through of resources and examples of different machine learning algorithms/techniques be it supervised, unsupervised, or model validation... however, the wording used and mistakes made when describing supervised learning or Naive Bayes shows that this is an attempt at taking an O'Rielly book and trying to summarize it in a short article... while making errors... How did it get so many points on ycombinator?
[+] princesspea|9 years ago|reply
Hi! I'm Stephanie Kim and wrote the article. This post and slides were from a talk I gave for a basic introduction to machine learning at a woman's programming conference in Seattle. I did update the language which was a mistake rather than a misunderstanding of Naive Bayes. I have professional machine learning experience and while I am definitely not an expert the talk was geared for web developers with no prior experience in machine learning. Thanks for your feedback.
[+] highCs|9 years ago|reply
If one understand decently all of that, does he get a job?
[+] DrNuke|9 years ago|reply
Supply-demand dynamics at play. The general answer is no, though. Chances are higher if you use these in your own domain and become an applied expert or if you win some competition on Kaggle or similar tough envinronment.
[+] lisivka|9 years ago|reply
If you will be able to debug all that, you will get a job.
[+] hota_mazi|9 years ago|reply
Not even a mention of TensorFlow or Torch?