I'm the author, and I'm happy to answer any questions.
The book should be in print by (hopefully) the end of this year, or definitely by Jan 2017. The content will not change significantly, but there is will be minor fixes and a lot of proof reading.
Great book, I'm getting a lot out of the site and I'm looking forward to the release. Thanks!
I understand there is always one more library or topic that could be included...
.. but with that acknowledged, what do you think of sqldf as an alternative to dplyr? You mention that dplyr is a bit easier (within the context of being specialized for data analysis). I'd have trouble weighting in because I don't use R all that much, but I do really like the python "equivalent" pandasql.
Also, I've used SQL for a long time, so I'd have trouble at this point really knowing what's "easier" for someone new to both, but I do often find it easier to use SQL than do data frame operations in pandas. dplyr seems to be a closer cousin to standard SQL, so the difference might not be quite as great.
Trivial, self-serving question: is there a library for generating the diagram of table relationships here (13.2 nycflights13) http://r4ds.had.co.nz/relational-data.html
And of course, thanks for another great book, it's helpful for learning R but I'm always enlightened by how thoroughly you explain the general concepts (e.g. Relational data and joins). Have heard a few people on faculty speak enthusiastically about the book even as I hold out for more adoption of Python :)
Hey Hadley. Huge fan of your work! Many of the libraries you have authored or co-authored have had a big influence on how I think about building tools. I looks forward to getting a hard copy of the book!
I have a bit of a nitpick about chapter 13 on "relational data", in which I believe you are consistently misusing the technical term "relation" to refer to the relationship between two data sets. In the context of relational database theory, "relation" is just another word for "table" (although it connotes more mathematical formalism).
I think it is worth respecting the precise technical usage in this case: Consider a student who might read your book and be told that "relations are always defined between a pair of tables," and that "a primary key and the corresponding foreign key in another table form a relation." The same student might also stumble across the wikipedia page for the relational model and learn that "a relation is defined as a set of tuples that have the same attributes," and that the relational model "organizes data into one or more tables (or relations)."
I'm a software engineer who is already quite comfortable with Python and has more of an interest in machine learning than data science (as I understand it), is there any reason for me to learn R?
I've always had a hard time understanding pipes, so it may just be that the concept will take more time for me to grok. I thought that the little bunny foo foo example in section 18.2 was really hard to grasp. I think that doing a similarly in depth example with numerical data, while more boring, may make the concept easier to understand.
I've been using r4ds for the past couple weeks. This is the first time I've really understood how everything in R and the tidyverse fits together. I am really enjoying the book. It has already helped me immensely.
R for Data Science is the canonical source for learning R and other real-world R tools such as dplyr/tidyr/ggplot2, and one I've recommended on HN submissions about R tutorials which simply go over primitative data types and out-of-date packages. (It's one of the reasons I've postponed making R tutorials myself, since the book would be better/more accurate in all circumstances.)
Some background on myself first. I am a financial consultant (only 1 year since graduating) and am planning to do a PhD in Accounting in the next 3 years. Currently working through the GMAT, but once that is complete, I will find myself with 2 or so years to do things that will help prepare me for research. One thing I have considered is taking a course/reading books on data science and such to prepare me for the advanced stats/data analysis that will go on during research. As someone with no coding experience, and with solid quant background (I was an economics undergrad), would this book be a good starting point for getting experience with this stuff? And is R the appropriate language to learn? I don't mind learning to code, but it is intimidating.
I'm coursing a research/PhD access master degree and all I can tell you R is one of the most popular tools. If it is used in your university go for it.
It's not just the features of the R programming language. It is also free, open source and can be integrated easily with many external tools. For example you can make a paper using Markdown(RMarkdown) and convert it to ->TEX->PDF. Or you can use Rmarkdown to make a presentation converting it to ->TEX->beamer slides. Or just output a html document/webpage. It is also well integrated with online databases and resources.
While it's true that it can be a bit less intuitive or not as visualization focused as compared to some other tools, it is just as powerful under the hood.
Other open source tools like python are good for programming but when it comes to data analysis they can be a little bit cumbersome. For example, python is quite object oriented and for straightforward purposes with low reusability the amount of code needed can be large.
There are other tools like Matlab, Mathematica, SPSS... Matlab is good at visualization and Mathematica has really nice features aimed to improve understanding. However these are closed source and cost a lot of money to you or the university.
Personally I work in macro and fixed income market analysis (strategist), and I can heartily recommend R as your first language. Indeed, coming from a CS background, I first applied Python to many problems, and resisted R which was not a "grown up" programming language, in my opinion (some would make the same accusation on Python). However I dipped my toe in the water one day because R had a Bloomberg terminal add in and Python did not (at the time), and after about a month of uphill learning curve the eureka moments started materializing thick and fast. I cannot recommend R enough, as a problem exploration language. It just beats Python hands down when it comes to grabbing some (usually dirty) data, mangling it around, cleaning it, and then install.package'ing a bunch of potentially useful libraries which allow you to do everything you could possibly imagine to a small to medium sized data set. And crucially, static graphing. Nothing else comes close for this use case.
Now...caveats. R is not a production programming language. If you find yourself creating something truly useful for many users, that requires robust programming language structures such as threading, proper memory management, server-capability, or indeed, speed, R is going to become frustrating. Yes a whole bunch of people will tell you "it's possible, I do it, etc", but that is not its sweet spot. Also, if your data set is bigger than 2-3 gig or so, you're going to start hitting R's memory management wall. It's slow. You'll then be better off with Python, C, or indeed, Scala, or possibly, Apache Spark. The common thing about these caveats, however, is that they're definitely second order problems, later in your career life cycle, than the excellent mainstream data science tool which is R for people who have outgrown Excel, but are not full fledged computer scientists, and who want to get (lots of) stuff, done.
(by the way, pre-empting comments. Yes Pandas is great, but no it's not quite R).
I'm learning R for fun at the moment. I'm sure it's super useful for statisticians but it's quite an intricate language! It's an unlikely mix of different paradigms and features mixed together. Not something I'd recommend to a beginner programmer, yet it seems that people love it (even non-programmers).
I looked at several tutorials and what worked for me the best so far are the official manuals https://cran.r-project.org/manuals.html (esp. the language definition and the "introduction to R").
Moreover, for the programming languages enthusiasts, the following article is pretty interesting:
Evaluating the Design of the R Language (Morandat, Hill, Osvald, Vitek).
"R is a dynamic language for statistical computing that combines lazy functional features and object-oriented programming. This rather unlikely linguistic cocktail would probably never have been prepared by computer scientists, yet the language has become surprisingly popular. With millions of lines of R code available in repositories, we have an opportunity to evaluate the fundamental choices underlying the R language design. Using a combination of static and dynamic program analysis we can assess the impact and success of different language features."
I will say, having struggled to use R and finally prevailing, that I will NEVER use anything else for generating figures for academic publications or internal reports (ggplot2 - same as the author of course).
Nothing comes close to the composable, functional way that it "just works" -- I don't use it every day - but it is my "go to" for data exploration etc. -- more than pandas / ipython.
You might enjoy <http://adv-r.had.co.nz/>, which discusses R from more of a programming language perspective (albeit a programming language that is chiefly used for data analysis). There are a lot of misunderstanding about R the language.
We also have almost 10,000 forkable & executable R examples on Kaggle (https://www.kaggle.com/kernels - select R from languages). Almost all of these use at least one of Hadley's libraries
hadley, I love your book, and I learn a lot from your preferences in R packages. Now is there a general source for determining the "best" packages for various tasks?
CRAN has task views, but they are long lists and don't clearly show popularity or feature matrices. There are just so many options.
Hadley, can you share a bit more about your plans for modelr and what need(s) the package will be designed to solve? Congrats on your book btw, I've been reading it for a few weeks and it's quite simply excellent.
I don't think modelr is going to change significantly in the future. It solved a pressing problem (fitting models as part of a pipeline) so I could teach modelling using the same interface as everything else in the book.
However, the modelling infrastructure in R is generally showing it's age, and thinking about how to make modelling easier is something that I will be working on in the coming months.
I LOVE Python and really am pleased with Pandas, but ...
I use R exclusively for data science. Really encourage you to just give it a try. The tools, packages, community and the industry support is just awesome.
I did my reports for the end of the year and people loved the reports but the office is so MS Office focused that they wanted them in Word and PowerPoint (UGH), R has great tools for that RMarkdown and ReportRs library convinced me to switch. Also index being 1 is super strong selling point from now on for doing data science. http://davidgohel.github.io/ReporteRs/
There's a large number of such books, though none that are as authoritative with respect to Python (this is a statement about the size of Python's community vs. R, not necessarily about the authors):
[+] [-] hadley|9 years ago|reply
The book should be in print by (hopefully) the end of this year, or definitely by Jan 2017. The content will not change significantly, but there is will be minor fixes and a lot of proof reading.
[+] [-] geebee|9 years ago|reply
I understand there is always one more library or topic that could be included...
.. but with that acknowledged, what do you think of sqldf as an alternative to dplyr? You mention that dplyr is a bit easier (within the context of being specialized for data analysis). I'd have trouble weighting in because I don't use R all that much, but I do really like the python "equivalent" pandasql.
Also, I've used SQL for a long time, so I'd have trouble at this point really knowing what's "easier" for someone new to both, but I do often find it easier to use SQL than do data frame operations in pandas. dplyr seems to be a closer cousin to standard SQL, so the difference might not be quite as great.
[+] [-] danso|9 years ago|reply
And of course, thanks for another great book, it's helpful for learning R but I'm always enlightened by how thoroughly you explain the general concepts (e.g. Relational data and joins). Have heard a few people on faculty speak enthusiastically about the book even as I hold out for more adoption of Python :)
[+] [-] siddboots|9 years ago|reply
I have a bit of a nitpick about chapter 13 on "relational data", in which I believe you are consistently misusing the technical term "relation" to refer to the relationship between two data sets. In the context of relational database theory, "relation" is just another word for "table" (although it connotes more mathematical formalism).
I think it is worth respecting the precise technical usage in this case: Consider a student who might read your book and be told that "relations are always defined between a pair of tables," and that "a primary key and the corresponding foreign key in another table form a relation." The same student might also stumble across the wikipedia page for the relational model and learn that "a relation is defined as a set of tuples that have the same attributes," and that the relational model "organizes data into one or more tables (or relations)."
[+] [-] Eridrus|9 years ago|reply
[+] [-] glogla|9 years ago|reply
Lot of the exercises, especially in the exploratory data analysis part are "why is blah?" or "is there a relationship in blah?"
I think I know the answers, but it would be nice to be able to check if I see what I'm supposed to se in the data.
[+] [-] nthot|9 years ago|reply
I've been using r4ds for the past couple weeks. This is the first time I've really understood how everything in R and the tidyverse fits together. I am really enjoying the book. It has already helped me immensely.
Seriously, thank you for writing this.
[+] [-] facorreia|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] nickdavidhaynes|9 years ago|reply
[+] [-] Eridrus|9 years ago|reply
[deleted]
[+] [-] minimaxir|9 years ago|reply
[+] [-] Kevin_S|9 years ago|reply
Some background on myself first. I am a financial consultant (only 1 year since graduating) and am planning to do a PhD in Accounting in the next 3 years. Currently working through the GMAT, but once that is complete, I will find myself with 2 or so years to do things that will help prepare me for research. One thing I have considered is taking a course/reading books on data science and such to prepare me for the advanced stats/data analysis that will go on during research. As someone with no coding experience, and with solid quant background (I was an economics undergrad), would this book be a good starting point for getting experience with this stuff? And is R the appropriate language to learn? I don't mind learning to code, but it is intimidating.
Thanks!
[+] [-] javitury|9 years ago|reply
It's not just the features of the R programming language. It is also free, open source and can be integrated easily with many external tools. For example you can make a paper using Markdown(RMarkdown) and convert it to ->TEX->PDF. Or you can use Rmarkdown to make a presentation converting it to ->TEX->beamer slides. Or just output a html document/webpage. It is also well integrated with online databases and resources.
While it's true that it can be a bit less intuitive or not as visualization focused as compared to some other tools, it is just as powerful under the hood.
Other open source tools like python are good for programming but when it comes to data analysis they can be a little bit cumbersome. For example, python is quite object oriented and for straightforward purposes with low reusability the amount of code needed can be large.
There are other tools like Matlab, Mathematica, SPSS... Matlab is good at visualization and Mathematica has really nice features aimed to improve understanding. However these are closed source and cost a lot of money to you or the university.
[+] [-] vegabook|9 years ago|reply
Now...caveats. R is not a production programming language. If you find yourself creating something truly useful for many users, that requires robust programming language structures such as threading, proper memory management, server-capability, or indeed, speed, R is going to become frustrating. Yes a whole bunch of people will tell you "it's possible, I do it, etc", but that is not its sweet spot. Also, if your data set is bigger than 2-3 gig or so, you're going to start hitting R's memory management wall. It's slow. You'll then be better off with Python, C, or indeed, Scala, or possibly, Apache Spark. The common thing about these caveats, however, is that they're definitely second order problems, later in your career life cycle, than the excellent mainstream data science tool which is R for people who have outgrown Excel, but are not full fledged computer scientists, and who want to get (lots of) stuff, done.
(by the way, pre-empting comments. Yes Pandas is great, but no it's not quite R).
[+] [-] sndean|9 years ago|reply
Though they suggest Garrett's book [1] as a companion to R for Data Science in the Prerequisites section [2].
> And is R the appropriate language to learn?
I'd think so. Possibly either R or Python, or both (if you want to get beyond Excel).
[1] https://www.amazon.com/dp/1449359019 [2] http://r4ds.had.co.nz/intro.html
[+] [-] hadley|9 years ago|reply
[+] [-] zzleeper|9 years ago|reply
Also, RMarkdown looks incredibly well thought out
[+] [-] hadley|9 years ago|reply
[+] [-] yodsanklai|9 years ago|reply
I looked at several tutorials and what worked for me the best so far are the official manuals https://cran.r-project.org/manuals.html (esp. the language definition and the "introduction to R").
Moreover, for the programming languages enthusiasts, the following article is pretty interesting:
Evaluating the Design of the R Language (Morandat, Hill, Osvald, Vitek).
"R is a dynamic language for statistical computing that combines lazy functional features and object-oriented programming. This rather unlikely linguistic cocktail would probably never have been prepared by computer scientists, yet the language has become surprisingly popular. With millions of lines of R code available in repositories, we have an opportunity to evaluate the fundamental choices underlying the R language design. Using a combination of static and dynamic program analysis we can assess the impact and success of different language features."
[+] [-] dfsegoat|9 years ago|reply
I will say, having struggled to use R and finally prevailing, that I will NEVER use anything else for generating figures for academic publications or internal reports (ggplot2 - same as the author of course).
Nothing comes close to the composable, functional way that it "just works" -- I don't use it every day - but it is my "go to" for data exploration etc. -- more than pandas / ipython.
[+] [-] hadley|9 years ago|reply
[+] [-] benhamner|9 years ago|reply
We also have almost 10,000 forkable & executable R examples on Kaggle (https://www.kaggle.com/kernels - select R from languages). Almost all of these use at least one of Hadley's libraries
[+] [-] mooneater|9 years ago|reply
CRAN has task views, but they are long lists and don't clearly show popularity or feature matrices. There are just so many options.
Im thinking something like https://djangopackages.org , for example see https://djangopackages.org/grids/g/commenting/
[+] [-] hadley|9 years ago|reply
[+] [-] dreww2|9 years ago|reply
[+] [-] hadley|9 years ago|reply
However, the modelling infrastructure in R is generally showing it's age, and thinking about how to make modelling easier is something that I will be working on in the coming months.
[+] [-] Rekushi|9 years ago|reply
[+] [-] baldfat|9 years ago|reply
I use R exclusively for data science. Really encourage you to just give it a try. The tools, packages, community and the industry support is just awesome.
I did my reports for the end of the year and people loved the reports but the office is so MS Office focused that they wanted them in Word and PowerPoint (UGH), R has great tools for that RMarkdown and ReportRs library convinced me to switch. Also index being 1 is super strong selling point from now on for doing data science. http://davidgohel.github.io/ReporteRs/
[+] [-] danso|9 years ago|reply
- via Wes McKinney, creator of pandas (which makes Python about as close to R as you can get): https://www.amazon.com/Python-Data-Analysis-Wrangling-IPytho...
- http://joelgrus.com/2015/04/26/data-science-from-scratch-fir...
There are a bunch of books specific to machine learning too though I haven't read them myself.
[+] [-] catuskoti|9 years ago|reply
[deleted]