Try R — A new online course, for free

[+] Homunculiheaded|13 years ago|reply

For anyone interest in R without a background in statistics: I would highly recommend learning the two in parallel (if not statistics first).

R is first and foremost a language for statistical computing. You really aren't going get much out of it without working on some interesting data/stats problems. Plus for most hacker types I think being able to play with the statistics you're learning about with R can be a great learning aid.

However not only is it beneficial to learn stats with R, it is imho dangerous to learn R without some stats. There's already too much research being published with were 'p-value' means "the thing that t.test() output that I was told needs to be in the paper".

Because R lets you play so freely with stats I find it a great tool to gain greater intuition about certain mathematical principles, but there is a temptation to let the tool do the work and the thinking for you.

[+] seanlinehan|13 years ago|reply

This is a very, very good point. Though, many of the functions that R provides just won't make any sense at all if you don't have an intuition for the statistics behind it. I have found myself reading the papers published about specific functions in order to understand the results.

Do you have any resources that you suggest for beginners in statistics looking to learn on their own?

[+] chernevik|13 years ago|reply

How about not-beginners looking to refresh / deepen their intuitions?

I've recently been working with the Python toolset in this space -- pandas, numpy, matplotlib -- and run smack dab into my rusty regression analysis. In particular I need to better understand the distribution assumptions underlying the error distributions and the variances around the coefficient and intercept values.

Any suggestions for some deeper study / refresher?

[+] iaw|13 years ago|reply

While it is predominantly a statistics language there is also a huge wealth of data manipulation capabilities in functions like plyr, aggregate, *apply, ave, subset, etc.

Just in terms of organizing data sets, ignoring any statistical analysis, R is fantastic.

[+] kamaal|13 years ago|reply

Can you advice on good resources to learn statistics?

[+] seanlinehan|13 years ago|reply

This is great. I've nearly completed a class at UC Berkeley which was almost entirely in R and I can say with certainty that it is a marvelous language. It is powerful, concise, and has an incredibly robust community. I've experimented with many programming languages, but I have not used one which allows you to experiment as rapidly as R.

I'm currently going through the Codeschool lessons to see if there is anything that I may have missed in my class. So far so good!

Edit: The most important thing that I didn't see covered in the course was RStudio. Considering that R is more of a scripting language that a programming language, I've found that RStudio is instrumental is using the language to it's full capacity. While it's certainly possible, and in some cases optimal, to use R from the command line, my experience is that the GUI features of R studio are incredibly powerful. The ability to browse data frames and have graphs show up in the context of your work has been very useful for exploring and understanding data. Otherwise, the course does a pretty decent job introducing readers to the language and it's data strcutures.

[+] eternalban|13 years ago|reply

> .. RStudio ..

Looks great. Downloading it now. (Thanks for the heads-up!)

[+] wasd|13 years ago|reply

Stat 133 is great. I took it with Spector and he was phenomenal.

[+] wallee|13 years ago|reply

Revolution R is also really nice to develop in. I believe that you can get an academic version for free.

[+] stfu|13 years ago|reply

I'm just going through this and while I love the concept, there is one criticism from me. The course is very comfortable to walk through, but it doesn't make use of the benefits of the online environment.

It is just "this is how it goes, now type it" kind of teaching. I am almost done with the second session, and most likely have completely forgotten most of the content from the first. If anybody else is going to try teaching stuff in a similar way, please let me try to play/try out stuff as early as possible. Even if the exercises are completely pointless, please make them a bit harder than just exchanging an "+" for an "-". It feels so pointless having such a great learning environment and not using it to make it feel less of a brain-dump process.

[+] wiradikusuma|13 years ago|reply

Same here. And sometimes after being introduced to a concept or function, I will wonder, "hmm, what if i do this.." but the embedded REPL doesn't allow you to tinker.

[+] thedaveoflife|13 years ago|reply

For those learning R, this site: http://www.twotorials.com/ which I found on HN several months ago is fairly helpful as well.

[+] Tactix47|13 years ago|reply

Great link, thanks so much for sharing! Looks like a very complete introduction to working with R.

[+] mikedmiked|13 years ago|reply

Yes! I thought I would never find this again! Thank you so much!

[+] hkmurakami|13 years ago|reply

Thanks. I was sorely disappointed yesterday when I found out that Coursera classes follow a strict schedule (and that I couldn't look at the material right then) and that I wouldn't be able to try the Data Analysis with R course on their site.

I'll definitely check this out :).

[+] joshz|13 years ago|reply

Videos are available on Roger Peng's Youtube.

http://www.youtube.com/user/rdpeng/videos?flow=grid&view...

[+] bulldog|13 years ago|reply

https://class.coursera.org/compdata-2012-001/class/index

[+] iaw|13 years ago|reply

For those who want to go deep :

http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

[+] zmmmmm|13 years ago|reply

I feel like it starts off a little bit on the wrong foot by introducing basic types as scalar variables. In reality R has no scalar variables, everything is a vector, list, and scalars are immediately coerced into a vector eg:

    > is.vector("a")
    [1] TRUE

This might seem like nitpicking but it leads to a world of confusion when programmers used to languages with scalars start trying to use R that way and it took me several months of confusion and weird bugs before I finally clicked and started understanding R better.

[+] hadley|13 years ago|reply

Can you give an example where the confusion between a scalar and a length one vector is important? I'm trying to figure out how to better teach R to people familiar with other languages and understanding your stumbling blocks would be v. helpful.

[+] zkoch|13 years ago|reply

I really like this, but one big complaint is that the auto-scroll after completing a little task isn't correct. So each time after I "pass" a particular section, I have to manually scroll down with my trackpad.

[+] goldfeld|13 years ago|reply

R doesn't seem to get much frontpage love on HN, or even if it does and I haven't seen, what would people suggest is the technology for statistics going forward? I really hoped it would be around Clojure (e.g. Incanter[1]) and not Python, for entirely selfish reasons.

[1]: http://incanter.org/

[+] saosebastiao|13 years ago|reply

The old joke is that the reason why R is awesome is that it was created by statisticians, and the reason why R sucks is that it was created by statisticians. As an every-day user of R, I can't help but think that description is perfect.

It also means that R isn't going away. It is getting more popular, and there is a ton of work on improving the runtime, which will only mitigate people's itches to move away from it. But most importantly, it has the network effects to its advantage.

Even as a Clojure lover, I can't see R ever being substituted. I see more hope for the Renjin project than I see for alternatives like Pandas/SciPi/NumPy, Julia, or Incanter.

[+] a_bonobo|13 years ago|reply

There used to be lot of hype on HN around the new statistical language Julia, not sure how far ahead and how good that thing actually is.

You can do a lot of good with scipy (especially scipy.stats), but I feel like matplotlib's plotting is still behind R.

[+] tokipin|13 years ago|reply

honestly i don't know of anything that can compare to Mathematica, besides its $300 price tag

[+] sonabinu|13 years ago|reply

A nytimes.com article on R outlining it's history and how it is moving from academia to main stream data analytics http://www.nytimes.com/2009/01/07/technology/business-comput...

[+] bernardom|13 years ago|reply

Beautiful website.

I highly recommend supplementing a course like this (where you learn about the language's ecosystem) with the R Cookbook from O'Reilly. It's been a lifesaver for me, and helped me learn R over the course of a few months of needing it at a new job.

Now I find that I need to learn something else for data munging- R is terrible at data manipulation and querying.

The querying bit is solvable with the incredibly useful sqldf package from Google. The package allows you to use SQL syntax to query your data.frames (by creating, populating, querying and deleting a psql table in the background).

Example: I have a dataframe named dfrm with columns named "id" "height" "name"

If I want the heights of all people whose names start with D, I would need to use:

> dfrm$height[which(substr(dfrm$name,1,1)=='D')]

Terse, but painful. Compare to:

> sqldf("select height from dfrm where name = 'D%'")

Much easier!

[+] agentq|13 years ago|reply

I actually find base R excellent for data munging and manipulation, even without using additional packages. Here is a reproducible example that very easily accomplishes what you were trying to do (first two lines just set up a sample data frame)

  set.seed(123)
  dfrm <- data.frame(height=runif(20),
                     name=paste(sample(LETTERS[1:5],20,replace=TRUE),letters[1:20]))
  subset(dfrm, grepl('^D',name), sel=height)

Basic R functions like subset, transform, with(in), reshape, aggregate, (a,ma,ta,sa,va}pply, match, grep(l), by, split, table, etc. allow you to accomplish just about any data frame munging you might want. Add on the plyr, reshape2, data.table, xts/zoo packages and you're ready to tackle just about anything.

I'm not a big fan of sqldf because imo R is not supposed to act like SQL. Using sqldf in practice would require a lot of query string manipulation and takes away from the nice functional features of R.

Nevertheless, it is very easy to write incomprehensible R code. The best way to avoid this is to take one of the existing style guides (Google, Hadley Wickham's) and adopt it seriously.

[+] wch|13 years ago|reply

In your R version, you don't need the call to `which()`, so you could do this instead:

    dfrm$height[substr(dfrm$name,1,1) == "D"]

And here's a much clearer way to do it:

    subset(dfrm, grepl("^D", name), select = height)

[+] id_ris|13 years ago|reply

I've been using R extensively for the past 12 months and have achieved a high level of comfort with the language. Now I find myself at a wall because of my lack of math and statistics background. I've taken R as far as I can, or put more properly, R has taken me as far as I can go without learning more math.

With that said, I have little reason to use R right now except for it's excellent plotting ability with ggplot2. Otherwise for data munging, wrangling, connecting to databases, doing unit testing, etc - R is a giant PITA. Better to stick to Python for that. And as I learn D3, I think I'll use R even less for visualization.

Therefore R will only be valuable to me once I can harness it's power for data mining and machine learning, which is it's killer feature, IMHO.

[+] hadley|13 years ago|reply

Would love to hear what you find most painful about data munging/wrangling and unit testing. It's something that I've been trying to improve in R (e.g. http://vita.had.co.nz/papers/tidy-data.html and http://journal.r-project.org/archive/2011-1/RJournal_2011-1_...)

[+] rjsamson|13 years ago|reply

I'm constantly impressed by the high quality content that Gregg and the rest of the Code School folks put out on a regular basis and Try R is no exception. Really excellent work! Looking forward to getting through the rest of it.

[+] taeric|13 years ago|reply

So, I've started using R for some stuff I'm doing at work. I have to say that I'm basically treating it as a non visual spreadsheet. Seems everything I've used it for so far, I could have done with excel. Am I doing it wrong?

[+] frankc|13 years ago|reply

Not doing it wrong, but only using a subset of R. For instance, R has powerful data manipulation ability that can get your data to use the subset of R that does what Excel does. R also has a huge library of packages that go way beyond what Excel can do, especially for statistics. Sure, you can do an ols regression in Excel, but you can you do a complicated machine learning model?

[+] iaw|13 years ago|reply

Nope, that's one way to look at it (especially if you're sticking with data frames).

The nice thing about that outlook is that you can essentially automate tasks you would normally perform on a spreadsheet.

[+] prashantganti|13 years ago|reply

I am currently reading the "The Art of R programming" by Norman Matloff and it is a good book for R beginners. Some familiarity with maths and stats basic is obviously required though.

[+] scrumper|13 years ago|reply

Was keen on trying this out. Crashed on me on the variables bit on the first page; just span and span. Happens a lot with these online interpreter things.

[+] prezjordan|13 years ago|reply

Oh man I really wish I had this at the beginning of the semester. I'm towards the end of a grueling stats course - difficult, and not the best professor. Each homework assignment I feel like I barely scrape by without really learning. This is the first time I've ever felt this way about school.

[+] aggronn|13 years ago|reply

This has been a very common theme through my undergraduate stats education.

[+] keithpeter|13 years ago|reply

Section 2.1 contains the following instruction

   "Try creating a vector of numbers, like this:"

So I typed c(5, 9, 11) and got an error message.

They meant

   "Type the R code to create this vector:"

I shall work through the rest in a few days, nice environment.

[+] merlinsbrain|13 years ago|reply

I was very interested in the course syllabus for 'Statistics One' by Prof. Andrew Conway. I missed the course on Coursera and now I'm unable to view the course archive. Does anyone know where I can find the lectures? (Yes, I've googled some.)

[+] tomku|13 years ago|reply

I took that course when it was running on Coursera, and I honestly can't recommend it (in its current state, at least) to anyone looking to learn basic statistics.

It covered a lot of material, but the quality and order of coverage was very inconsistent. The first couple weeks were fine, but it felt really odd to jump from correlations and scatterplots into regression, then come back to t-tests and AOV afterwards. There were also some errors in the R code on the slides, which led to a lot of confusion on the discussion forums during the class. As a student, it didn't feel like the class's pedagogical approach was very good, and I'm now finding myself using other resources to fill in the gaps.

If you'd like to hear more about those other resources I'd gladly post a list, but they're mostly Python-centric. One that I can whole-heartedly recommend even if you stick with Prof. Conway's class is the set of lectures from Roger Peng's "Computing for Data Analysis" class on Coursera. The course itself isn't available at the moment, but the videos are on his Youtube channel[1]. It teaches R from a programming perspective, and you'll find the content invaluable once you start writing R code that's more complex than a couple stats functions and a plot.

[1]: https://www.youtube.com/user/rdpeng/videos?flow=grid&vie...

[+] MichaelJW|13 years ago|reply

Take a look here: http://www.universalsubtitles.org/en/teams/coursera/?project...

Credit: http://www.aiqus.com/questions/38783/download-upcoming-cours...

Edit: Is this like posting a pirate link?

[+] nasir|13 years ago|reply

I did my master thesis stat works in R. it was a pain at the begining because I was not very good with stats. so learn the stats alongside.

125 comments