Introduction to Machine Learning for Developers

[+] rav|9 years ago|reply

The description of Naive Bayes is misleading. Almost all supervised learning problems assume that "Inputs are classified in isolation where no input has an effect on any other inputs" (quote from the article), but that's not why Naive Bayes is called naive.

The naive assumption made by Naive Bayes is that the features (or attributes) of each input point are independent. Let me explain from a simple example:

Suppose you want to find people who receive benefits they are not entitled to. The input data might have two attributes: cash on bank account, and amount received in benefits. Although you could look for data that have a high value in both attributes, the naive assumption made in Naive Bayes says that you can in fact make your classification without correlating multiple attributes; Naive Bayes assumes you can explain the labeling of data just by looking at attributes in isolation. In this example, this assumption is clearly unfounded, since if you only look at benefits or only at cash balance, you won't be able to tell how a person should be classified.

The data independence assumption made by almost all ML algorithms is that different data points are not correlated: the label of a single data point (person in the above problem) does not depend on the attributes of other data points.

[+] machineman44|9 years ago|reply

I agree. Most supervised learning classifiers are derived based on the independent and identically distributed assumption for each (x,y) pair.

To be more specific about the Naive Bayes assumption, the features of a data point are conditionally independent instead of simply independent. This means that given a certain label, these set of features are independent.

[+] princesspea|9 years ago|reply

Hi, I'm Stephanie Kim and wrote the talk/post. Thanks for the comment! Yes you are correct I should have specified that it is the features of each input rather than the inputs that are regarded as independent from one another! I will revise that in the post. Again, thanks for pointing that out since it's an important distinction, especially for people just starting out!

[+] Bedon292|9 years ago|reply

I cannot recommend scikit-learn enough to anyone interested in machine learning who likes python. I have been working with it for part of my Thesis, and it can do so much, with so little code. It is amazing.

[+] atsheehan|9 years ago|reply

If anyone is interested in learning more about scikit-learn, I'd recommend "Hands-On Machine Learning with Scikit-Learn and Tensorflow" from O'Reilly:

http://shop.oreilly.com/product/0636920052289.do

When I first started using scikit-learn, I was overwhelmed with the number of classes and options available. I just chose some basic classifiers I was familiar with and stuck with most of the default settings. The book explains many of the other models and when they would be useful, but also spends a lot of time exploring the datasets (using pandas), preprocessing data and building data pipelines, finding the best hyperparameters, best ways to evaluate a models performance, etc. The library feels less like a big bag of algorithms now and more like a cohesive data pipeline.

[+] avitzurel|9 years ago|reply

I second this.

I've been somewhat addicted to HackerRank challenges over the last couple of weeks. Why is not important, don't judge :)

The python packages and tooling around learning and science are truly amazing. Try and do the Craigslist category classification without using python and see what I mean.

[+] adamnemecek|9 years ago|reply

For anyone trying to get into the field, I put together a list of resources I found useful:

https://news.ycombinator.com/item?id=12900448

[+] peterhadlaw|9 years ago|reply

Although I did my studies with NLTK, it looks like spaCy has stepped up to the plate, particular with NLP related tasks.

Worth checking out, in parallel to or in place of NLTK: https://spacy.io

[+] voiceclonr|9 years ago|reply

Never heard of it. Thanks for the pointer!

[+] pknerd|9 years ago|reply

For Python programmers Harrison's Website is awesome resource

http://pythonprogramming.net

[+] anton_tarasenko|9 years ago|reply

I've made a similar list for economists. It included a list of practical applications of ML. Developers can get a sense of what the discipline can do before jumping in.

APPLIED MACHINE LEARNING CASES

## Business

1. Kaggle, Data Science Use cases. An outline of business applications. Few companies have the data to implement these things. https://www.kaggle.com/wiki/DataScienceUseCases

2. Kaggle, Competitions. (Make sure you chose “All Competitions” and then “Completed”.) Each competition has a leaderboard. When users publish their solutions on GitHub, you can find links to these solutions on the leaderboard. https://www.kaggle.com/competitions

Industrial solutions are more powerful and complex than these examples, but they are not publicly available. Data-driven companies post some details about this work in their blogs.

## Emerging applications

1. Stanford’s CS229 Course, Student projects. See “Recent years’ projects.” Hundreds of short papers. http://cs229.stanford.edu/

2. CMU ML Department, Student projects. More advanced problems, compared to CS229. http://www.ml.cmu.edu/research/data-analysis-projects.html

3. arXiv, Machine Learning. Drafts of important papers appear here first. Then they got published in journals. http://arxiv.org/list/stat.ML/recent

4. CS journals. Applied ML research also appear in engineering journals. https://scholar.google.com/citations?view_op=top_venues&hl=e...

5. CS departments. For example: CMU ML Department, PhD dissertations. http://www.ml.cmu.edu/research/phd-dissertations.html

## Government

1. Bloomberg and Flowers, “NYC Analytics.” NYC Mayor’s Office of Data Analysis describes their data management system and improvements in operations. http://www.nyc.gov/html/analytics/downloads/pdf/annual_repor...

2. UK Government, Tax Agent Segmentation. https://www.gov.uk/government/uploads/system/uploads/attachm...

3. Data.gov, Applications. Some are ML-based. http://www.data.gov/applications

4. StackExchange, Applications. http://opendata.stackexchange.com/questions/3346/examples-of...

[+] lonewolf_ninja|9 years ago|reply

I recently started playing around with the data sets on past Kaggle competitions and have been learning a lot.

The Data Science use cases there are quite interesting. Are there any publicly available data-sets (other than the ones available in competitions) to work with (especially for the marketing use cases)?

[+] mi100hael|9 years ago|reply

Cool, this is a helpful intro. Anyone have any recommended reading for ML in a JVM context?

[+] esfandia|9 years ago|reply

Weka is a popular machine learning toolkit in Java: http://www.cs.waikato.ac.nz/ml/weka/

and they have a textbook to go with it: http://www.cs.waikato.ac.nz/ml/weka/book.html

as well as an online course: https://weka.waikato.ac.nz/dataminingwithweka/preview

[+] agibsonccc|9 years ago|reply

Hi, skymind cofounder here. To offer some context on our book there: We have appendixes covering some of the fundamental concepts such as linear algebra and statistics.

For other machine learning libraries in java:

https://github.com/haifengl/smile

http://knime.org

http://rapidminer.com

[+] marcinzm|9 years ago|reply

If you're interested in the big data side of things there's Spark (http://spark.apache.org/) and MLlib for it (http://spark.apache.org/docs/latest/ml-guide.html). H20 (http://www.h2o.ai/) also provides ML algorithms on top of Spark (and I think independent of Spark as well, not sure of the current status). These are all written on the JVM either in Scala (Spark) or Java (H20).

[+] otoburb|9 years ago|reply

The Skymind.io co-founders wrote a book that references their open-source "deep learning for Java and Scala framework"[1]. They are in the YC16 batch.

[1] https://deeplearning4j.org/about

[+] adamnemecek|9 years ago|reply

I haven't read it but this book looks reasonable https://www.amazon.com/Scala-Machine-Learning-Patrick-Nicola...

[+] pineapple_sauce|9 years ago|reply

In the slides for unsupervised learning, what is meant by "Maximum Entropy"? Doesn't this just imply that the distribution will be uniform; i.e. it's no better than making a blind guess?

[+] machineman44|9 years ago|reply

I have only seen a maximum entropy model as part of the supervised realm where it is a discriminative model. In other words, given some labeled data, we can draw a decision boundary. Maximum entropy in this context is almost certainly associated with the information theory definition, where the entropy of a collection of data based on the distribution of classes is measured. High entropy if each class is equally probable. Lower Entropy otherwise.

[+] unknown|9 years ago|reply

[deleted]

[+] machineman44|9 years ago|reply

Honestly, this is a good run through of resources and examples of different machine learning algorithms/techniques be it supervised, unsupervised, or model validation... however, the wording used and mistakes made when describing supervised learning or Naive Bayes shows that this is an attempt at taking an O'Rielly book and trying to summarize it in a short article... while making errors... How did it get so many points on ycombinator?

[+] princesspea|9 years ago|reply

Hi! I'm Stephanie Kim and wrote the article. This post and slides were from a talk I gave for a basic introduction to machine learning at a woman's programming conference in Seattle. I did update the language which was a mistake rather than a misunderstanding of Naive Bayes. I have professional machine learning experience and while I am definitely not an expert the talk was geared for web developers with no prior experience in machine learning. Thanks for your feedback.

[+] highCs|9 years ago|reply

If one understand decently all of that, does he get a job?

[+] DrNuke|9 years ago|reply

Supply-demand dynamics at play. The general answer is no, though. Chances are higher if you use these in your own domain and become an applied expert or if you win some competition on Kaggle or similar tough envinronment.

[+] lisivka|9 years ago|reply

If you will be able to debug all that, you will get a job.

[+] hota_mazi|9 years ago|reply

Not even a mention of TensorFlow or Torch?

31 comments