top | item 9442384

Data Science from Scratch: First Principles with Python

237 points| joelgrus | 11 years ago |joelgrus.com | reply

41 comments

order
[+] danso|11 years ago|reply
I was going to ask, "Why Python 2.x"? But then I just bought the book. Hope you don't mind if I post this excerpt:

> As I write this, the latest version of Python is 3.4. At DataSciencester, however, we use old, reliable Python 2.7. Python 3 is not backward-compatible with Python 2, and many important libraries only work well with 2.7. The data science community is still firmly stuck on 2.7, which means we will be, too. Make sure to get that version.

I use the more popular scientific libraries, e.g. numpy, scikit, nltk....and the bigger ones seem to have been ported over to 3.x. A few libs that haven't that come to mind: mechanize and opencv. Has anyone here had success with using 3.x as a data science professional, or is there some massive gaping hole that I'm missing? (I agree that, "Well, this is what the company has been using" is a decent enough excuse to stay on 2.x in most situations)

[+] rdtsc|11 years ago|reply
Even some projects that claim have been ported, will often have bugs in them because it is new code. Then it is a question of do I have time or want test the port on my production system? I just kind of look at the issue or commit stream and see when issues appearing related to Python 3 start to slow down a bit.
[+] Omnipresent|11 years ago|reply
I just finished the ML class from Georgia Tech as part of the OMSCS program. I used SciKit for most of the assignments as they involved NNs, DT, KNN, K-means, EM. This might be a naive question as I'm not a python guy but is there a reason this book is python based but doesn't cover scikit-learn? For example, what need did you see to write code for k-means[1] than to use an implementation already available [2]

[1] https://github.com/joelgrus/data-science-from-scratch/blob/m... [2] http://scikit-learn.org/stable/modules/generated/sklearn.clu...

[+] treycausey|11 years ago|reply
The title of the book includes "from scratch" for a reason -- it's from "first principles" where you learn about something by building it up from scratch rather than using an implementation. At the end of each chapter, Joel points out the existing resources you can use after learning about the topic.
[+] jplahn|11 years ago|reply
Looks great Joel! Definitely going to check this out and start working through it. I've noticed the huge bifurcation between extremely applied data science and almost entirely mathematical based. I was always wary of 'learning' data science through applications only, but as you alluded to, it's significantly more exciting. Likewise, most introductory statistics classes are so poorly delivered that many people have a deeply ingrained fear of the underlying concepts.

As a side note, do you attend any data events in Seattle? I'm moving there in June after graduation and would love to talk with somebody doing my dream job.

[+] joelgrus|11 years ago|reply
I attend a lot of data events in Seattle. Especially Data Science Happy Hours.
[+] sputknick|11 years ago|reply
Any chance there is a discount code to encourage early readers?
[+] joelgrus|11 years ago|reply
AUTHD

(And I didn't know that until you asked, I'm going to edit the blog post.)

[+] barely_stubbell|11 years ago|reply
Does anyone have any recommendations of books that might pair well with this one in the math/data/statistics space? Thought I might pick up a few books and score some free shipping.
[+] thehoff|11 years ago|reply
Looks interesting, I'll probably pick this up.

Have you posted to DataTau?

[+] jsnk|11 years ago|reply
This book looks very close to what my girlfriend is looking for. She's interested in learning bioinformatics and it's been difficult to find a good book that introduces topics in data science in a digestible manner.

If anyone knows the book, can you give a quick overview of how much, math, stats, programming and comp sci. you'd need to read this book? Thank you.

[+] joelgrus|11 years ago|reply
I know the book, I wrote it!

Most of the math is vector space arithmetic. There are a few sections that use matrix multiplication. The probability and stats is stuff like understanding probability distributions and Bayes's Theorem. (It's all covered in the book, but you'd need to be comfortable picking it up and using it.)

In terms of programming, not much. Someone who's never programmed before would probably have a tough time, but the goal is that someone who is bright and hardworking and who can write fairly simple Python programs should not have a problem. Very little CS background required. Maybe basic data structures like list vs dict and so on.

[+] a_bonobo|11 years ago|reply
She may also enjoy Vince Buffalo's Bioinformatics Data Skills

http://shop.oreilly.com/product/0636920030157.do

It's more focused on how to analyze existing biological data with the shell, R, and how to use git.

Personally, I've rarely seen advanced machine learning being used outside of genome-wide association studies, and even there most people just use PLINK's logistic regression without understanding what's being done and call it a day.

Another really good book on how to understand statistics is Motulsky's Intuitive Biostatistics - it introduces all common "tests" and methodologies people working in the life sciences use, but without the formulas (you use R for that anyway). It's more about the caveats of each test, in which situation you'd use it, what can go wrong, how to interpret the results etc., all written in a very lively style.

[+] mdesq|11 years ago|reply
Thanks Joel. I just purchased both the print and ebook copies from O'Reilly. This is exactly what I've been looking for.
[+] kunjaan2|11 years ago|reply
Could you please post this on /r/machinelearning as well with an AMA? Thanks.
[+] blumkvist|11 years ago|reply
Good that it discusses overfitting.

I can't help but wonder about those recommender systems. With so little material on statistics I have to assume it's only about observational data, which is the best way to make the millionth+1 useless recommendation engine.

And why is it that the "data science" books never discuss DoE?

[+] x0x0|11 years ago|reply
Most data scientists, particularly those coming from the cs department, lack most probability and statistics fundamentals. I doubt many of them have even heard of an anova, or sampling distributions, F tests, chi2, etc.

To be fair, ml tends to focus very heavily on prediction, not inference / interpretation of betas. In many tree models how to even understand coefs is an open question.

[+] Goladus|11 years ago|reply
What is DoE? I hear DoE and I think "Department of Education" or "Department of Energy" which is also what comes up in a web search...
[+] ced|11 years ago|reply
Do you have any reading recommendation about DoE?
[+] memilanuk|11 years ago|reply
I've been wondering just about exactly the same thing - why DoE gets no love, while everyone is crazy for ML.