For anyone interest in R without a background in statistics: I would highly recommend learning the two in parallel (if not statistics first).
R is first and foremost a language for statistical computing. You really aren't going get much out of it without working on some interesting data/stats problems. Plus for most hacker types I think being able to play with the statistics you're learning about with R can be a great learning aid.
However not only is it beneficial to learn stats with R, it is imho dangerous to learn R without some stats. There's already too much research being published with were 'p-value' means "the thing that t.test() output that I was told needs to be in the paper".
Because R lets you play so freely with stats I find it a great tool to gain greater intuition about certain mathematical principles, but there is a temptation to let the tool do the work and the thinking for you.
This is a very, very good point. Though, many of the functions that R provides just won't make any sense at all if you don't have an intuition for the statistics behind it. I have found myself reading the papers published about specific functions in order to understand the results.
Do you have any resources that you suggest for beginners in statistics looking to learn on their own?
How about not-beginners looking to refresh / deepen their intuitions?
I've recently been working with the Python toolset in this space -- pandas, numpy, matplotlib -- and run smack dab into my rusty regression analysis. In particular I need to better understand the distribution assumptions underlying the error distributions and the variances around the coefficient and intercept values.
Any suggestions for some deeper study / refresher?
While it is predominantly a statistics language there is also a huge wealth of data manipulation capabilities in functions like plyr, aggregate, *apply, ave, subset, etc.
Just in terms of organizing data sets, ignoring any statistical analysis, R is fantastic.
This is great. I've nearly completed a class at UC Berkeley which was almost entirely in R and I can say with certainty that it is a marvelous language. It is powerful, concise, and has an incredibly robust community. I've experimented with many programming languages, but I have not used one which allows you to experiment as rapidly as R.
I'm currently going through the Codeschool lessons to see if there is anything that I may have missed in my class. So far so good!
Edit: The most important thing that I didn't see covered in the course was RStudio. Considering that R is more of a scripting language that a programming language, I've found that RStudio is instrumental is using the language to it's full capacity. While it's certainly possible, and in some cases optimal, to use R from the command line, my experience is that the GUI features of R studio are incredibly powerful. The ability to browse data frames and have graphs show up in the context of your work has been very useful for exploring and understanding data. Otherwise, the course does a pretty decent job introducing readers to the language and it's data strcutures.
I'm just going through this and while I love the concept, there is one criticism from me. The course is very comfortable to walk through, but it doesn't make use of the benefits of the online environment.
It is just "this is how it goes, now type it" kind of teaching. I am almost done with the second session, and most likely have completely forgotten most of the content from the first. If anybody else is going to try teaching stuff in a similar way, please let me try to play/try out stuff as early as possible. Even if the exercises are completely pointless, please make them a bit harder than just exchanging an "+" for an "-". It feels so pointless having such a great learning environment and not using it to make it feel less of a brain-dump process.
Same here. And sometimes after being introduced to a concept or function, I will wonder, "hmm, what if i do this.." but the embedded REPL doesn't allow you to tinker.
Thanks. I was sorely disappointed yesterday when I found out that Coursera classes follow a strict schedule (and that I couldn't look at the material right then) and that I wouldn't be able to try the Data Analysis with R course on their site.
I feel like it starts off a little bit on the wrong foot by introducing basic types as scalar variables. In reality R has no scalar variables, everything is a vector, list, and scalars are immediately coerced into a vector eg:
> is.vector("a")
[1] TRUE
This might seem like nitpicking but it leads to a world of confusion when programmers used to languages with scalars start trying to use R that way and it took me several months of confusion and weird bugs before I finally clicked and started understanding R better.
Can you give an example where the confusion between a scalar and a length one vector is important? I'm trying to figure out how to better teach R to people familiar with other languages and understanding your stumbling blocks would be v. helpful.
I really like this, but one big complaint is that the auto-scroll after completing a little task isn't correct. So each time after I "pass" a particular section, I have to manually scroll down with my trackpad.
R doesn't seem to get much frontpage love on HN, or even if it does and I haven't seen, what would people suggest is the technology for statistics going forward? I really hoped it would be around Clojure (e.g. Incanter[1]) and not Python, for entirely selfish reasons.
The old joke is that the reason why R is awesome is that it was created by statisticians, and the reason why R sucks is that it was created by statisticians. As an every-day user of R, I can't help but think that description is perfect.
It also means that R isn't going away. It is getting more popular, and there is a ton of work on improving the runtime, which will only mitigate people's itches to move away from it. But most importantly, it has the network effects to its advantage.
Even as a Clojure lover, I can't see R ever being substituted. I see more hope for the Renjin project than I see for alternatives like Pandas/SciPi/NumPy, Julia, or Incanter.
I highly recommend supplementing a course like this (where you learn about the language's ecosystem) with the R Cookbook from O'Reilly. It's been a lifesaver for me, and helped me learn R over the course of a few months of needing it at a new job.
Now I find that I need to learn something else for data munging- R is terrible at data manipulation and querying.
The querying bit is solvable with the incredibly useful sqldf package from Google. The package allows you to use SQL syntax to query your data.frames (by creating, populating, querying and deleting a psql table in the background).
Example: I have a dataframe named dfrm with columns named "id" "height" "name"
If I want the heights of all people whose names start with D, I would need to use:
> dfrm$height[which(substr(dfrm$name,1,1)=='D')]
Terse, but painful. Compare to:
> sqldf("select height from dfrm where name = 'D%'")
I actually find base R excellent for data munging and manipulation, even without using additional packages. Here is a reproducible example that very easily accomplishes what you were trying to do (first two lines just set up a sample data frame)
Basic R functions like subset, transform, with(in), reshape, aggregate, (a,ma,ta,sa,va}pply, match, grep(l), by, split, table, etc. allow you to accomplish just about any data frame munging you might want. Add on the plyr, reshape2, data.table, xts/zoo packages and you're ready to tackle just about anything.
I'm not a big fan of sqldf because imo R is not supposed to act like SQL. Using sqldf in practice would require a lot of query string manipulation and takes away from the nice functional features of R.
Nevertheless, it is very easy to write incomprehensible R code. The best way to avoid this is to take one of the existing style guides (Google, Hadley Wickham's) and adopt it seriously.
I've been using R extensively for the past 12 months and have achieved a high level of comfort with the language. Now I find myself at a wall because of my lack of math and statistics background. I've taken R as far as I can, or put more properly, R has taken me as far as I can go without learning more math.
With that said, I have little reason to use R right now except for it's excellent plotting ability with ggplot2. Otherwise for data munging, wrangling, connecting to databases, doing unit testing, etc - R is a giant PITA. Better to stick to Python for that. And as I learn D3, I think I'll use R even less for visualization.
Therefore R will only be valuable to me once I can harness it's power for data mining and machine learning, which is it's killer feature, IMHO.
I'm constantly impressed by the high quality content that Gregg and the rest of the Code School folks put out on a regular basis and Try R is no exception. Really excellent work! Looking forward to getting through the rest of it.
So, I've started using R for some stuff I'm doing at work. I have to say that I'm basically treating it as a non visual spreadsheet. Seems everything I've used it for so far, I could have done with excel. Am I doing it wrong?
Not doing it wrong, but only using a subset of R. For instance, R has powerful data manipulation ability that can get your data to use the subset of R that does what Excel does. R also has a huge library of packages that go way beyond what Excel can do, especially for statistics. Sure, you can do an ols regression in Excel, but you can you do a complicated machine learning model?
I am currently reading the "The Art of R programming" by Norman Matloff and it is a good book for R beginners. Some familiarity with maths and stats basic is obviously required though.
Was keen on trying this out. Crashed on me on the variables bit on the first page; just span and span. Happens a lot with these online interpreter things.
Oh man I really wish I had this at the beginning of the semester. I'm towards the end of a grueling stats course - difficult, and not the best professor. Each homework assignment I feel like I barely scrape by without really learning. This is the first time I've ever felt this way about school.
I was very interested in the course syllabus for 'Statistics One' by Prof. Andrew Conway. I missed the course on Coursera and now I'm unable to view the course archive. Does anyone know where I can find the lectures?
(Yes, I've googled some.)
I took that course when it was running on Coursera, and I honestly can't recommend it (in its current state, at least) to anyone looking to learn basic statistics.
It covered a lot of material, but the quality and order of coverage was very inconsistent. The first couple weeks were fine, but it felt really odd to jump from correlations and scatterplots into regression, then come back to t-tests and AOV afterwards. There were also some errors in the R code on the slides, which led to a lot of confusion on the discussion forums during the class. As a student, it didn't feel like the class's pedagogical approach was very good, and I'm now finding myself using other resources to fill in the gaps.
If you'd like to hear more about those other resources I'd gladly post a list, but they're mostly Python-centric. One that I can whole-heartedly recommend even if you stick with Prof. Conway's class is the set of lectures from Roger Peng's "Computing for Data Analysis" class on Coursera. The course itself isn't available at the moment, but the videos are on his Youtube channel[1]. It teaches R from a programming perspective, and you'll find the content invaluable once you start writing R code that's more complex than a couple stats functions and a plot.
[+] [-] Homunculiheaded|13 years ago|reply
R is first and foremost a language for statistical computing. You really aren't going get much out of it without working on some interesting data/stats problems. Plus for most hacker types I think being able to play with the statistics you're learning about with R can be a great learning aid.
However not only is it beneficial to learn stats with R, it is imho dangerous to learn R without some stats. There's already too much research being published with were 'p-value' means "the thing that t.test() output that I was told needs to be in the paper".
Because R lets you play so freely with stats I find it a great tool to gain greater intuition about certain mathematical principles, but there is a temptation to let the tool do the work and the thinking for you.
[+] [-] seanlinehan|13 years ago|reply
Do you have any resources that you suggest for beginners in statistics looking to learn on their own?
[+] [-] chernevik|13 years ago|reply
I've recently been working with the Python toolset in this space -- pandas, numpy, matplotlib -- and run smack dab into my rusty regression analysis. In particular I need to better understand the distribution assumptions underlying the error distributions and the variances around the coefficient and intercept values.
Any suggestions for some deeper study / refresher?
[+] [-] iaw|13 years ago|reply
Just in terms of organizing data sets, ignoring any statistical analysis, R is fantastic.
[+] [-] kamaal|13 years ago|reply
[+] [-] seanlinehan|13 years ago|reply
I'm currently going through the Codeschool lessons to see if there is anything that I may have missed in my class. So far so good!
Edit: The most important thing that I didn't see covered in the course was RStudio. Considering that R is more of a scripting language that a programming language, I've found that RStudio is instrumental is using the language to it's full capacity. While it's certainly possible, and in some cases optimal, to use R from the command line, my experience is that the GUI features of R studio are incredibly powerful. The ability to browse data frames and have graphs show up in the context of your work has been very useful for exploring and understanding data. Otherwise, the course does a pretty decent job introducing readers to the language and it's data strcutures.
[+] [-] eternalban|13 years ago|reply
Looks great. Downloading it now. (Thanks for the heads-up!)
[+] [-] wasd|13 years ago|reply
[+] [-] wallee|13 years ago|reply
[+] [-] stfu|13 years ago|reply
It is just "this is how it goes, now type it" kind of teaching. I am almost done with the second session, and most likely have completely forgotten most of the content from the first. If anybody else is going to try teaching stuff in a similar way, please let me try to play/try out stuff as early as possible. Even if the exercises are completely pointless, please make them a bit harder than just exchanging an "+" for an "-". It feels so pointless having such a great learning environment and not using it to make it feel less of a brain-dump process.
[+] [-] wiradikusuma|13 years ago|reply
[+] [-] thedaveoflife|13 years ago|reply
[+] [-] Tactix47|13 years ago|reply
[+] [-] mikedmiked|13 years ago|reply
[+] [-] hkmurakami|13 years ago|reply
I'll definitely check this out :).
[+] [-] joshz|13 years ago|reply
http://www.youtube.com/user/rdpeng/videos?flow=grid&view...
[+] [-] bulldog|13 years ago|reply
[+] [-] iaw|13 years ago|reply
http://www.burns-stat.com/pages/Tutor/R_inferno.pdf
[+] [-] zmmmmm|13 years ago|reply
[+] [-] hadley|13 years ago|reply
[+] [-] zkoch|13 years ago|reply
[+] [-] goldfeld|13 years ago|reply
[1]: http://incanter.org/
[+] [-] saosebastiao|13 years ago|reply
It also means that R isn't going away. It is getting more popular, and there is a ton of work on improving the runtime, which will only mitigate people's itches to move away from it. But most importantly, it has the network effects to its advantage.
Even as a Clojure lover, I can't see R ever being substituted. I see more hope for the Renjin project than I see for alternatives like Pandas/SciPi/NumPy, Julia, or Incanter.
[+] [-] a_bonobo|13 years ago|reply
You can do a lot of good with scipy (especially scipy.stats), but I feel like matplotlib's plotting is still behind R.
[+] [-] tokipin|13 years ago|reply
[+] [-] sonabinu|13 years ago|reply
[+] [-] bernardom|13 years ago|reply
I highly recommend supplementing a course like this (where you learn about the language's ecosystem) with the R Cookbook from O'Reilly. It's been a lifesaver for me, and helped me learn R over the course of a few months of needing it at a new job.
Now I find that I need to learn something else for data munging- R is terrible at data manipulation and querying.
The querying bit is solvable with the incredibly useful sqldf package from Google. The package allows you to use SQL syntax to query your data.frames (by creating, populating, querying and deleting a psql table in the background).
Example: I have a dataframe named dfrm with columns named "id" "height" "name"
If I want the heights of all people whose names start with D, I would need to use:
> dfrm$height[which(substr(dfrm$name,1,1)=='D')]
Terse, but painful. Compare to:
> sqldf("select height from dfrm where name = 'D%'")
Much easier!
[+] [-] agentq|13 years ago|reply
I'm not a big fan of sqldf because imo R is not supposed to act like SQL. Using sqldf in practice would require a lot of query string manipulation and takes away from the nice functional features of R.
Nevertheless, it is very easy to write incomprehensible R code. The best way to avoid this is to take one of the existing style guides (Google, Hadley Wickham's) and adopt it seriously.
[+] [-] wch|13 years ago|reply
[+] [-] id_ris|13 years ago|reply
With that said, I have little reason to use R right now except for it's excellent plotting ability with ggplot2. Otherwise for data munging, wrangling, connecting to databases, doing unit testing, etc - R is a giant PITA. Better to stick to Python for that. And as I learn D3, I think I'll use R even less for visualization.
Therefore R will only be valuable to me once I can harness it's power for data mining and machine learning, which is it's killer feature, IMHO.
[+] [-] hadley|13 years ago|reply
[+] [-] rjsamson|13 years ago|reply
[+] [-] taeric|13 years ago|reply
[+] [-] frankc|13 years ago|reply
[+] [-] iaw|13 years ago|reply
The nice thing about that outlook is that you can essentially automate tasks you would normally perform on a spreadsheet.
[+] [-] prashantganti|13 years ago|reply
[+] [-] scrumper|13 years ago|reply
[+] [-] prezjordan|13 years ago|reply
[+] [-] aggronn|13 years ago|reply
[+] [-] keithpeter|13 years ago|reply
They meant
I shall work through the rest in a few days, nice environment.[+] [-] merlinsbrain|13 years ago|reply
[+] [-] tomku|13 years ago|reply
It covered a lot of material, but the quality and order of coverage was very inconsistent. The first couple weeks were fine, but it felt really odd to jump from correlations and scatterplots into regression, then come back to t-tests and AOV afterwards. There were also some errors in the R code on the slides, which led to a lot of confusion on the discussion forums during the class. As a student, it didn't feel like the class's pedagogical approach was very good, and I'm now finding myself using other resources to fill in the gaps.
If you'd like to hear more about those other resources I'd gladly post a list, but they're mostly Python-centric. One that I can whole-heartedly recommend even if you stick with Prof. Conway's class is the set of lectures from Roger Peng's "Computing for Data Analysis" class on Coursera. The course itself isn't available at the moment, but the videos are on his Youtube channel[1]. It teaches R from a programming perspective, and you'll find the content invaluable once you start writing R code that's more complex than a couple stats functions and a plot.
[1]: https://www.youtube.com/user/rdpeng/videos?flow=grid&vie...
[+] [-] MichaelJW|13 years ago|reply
Credit: http://www.aiqus.com/questions/38783/download-upcoming-cours...
Edit: Is this like posting a pirate link?
[+] [-] nasir|13 years ago|reply