Why I use R | WingNews

[+] not_the_nsa|6 years ago|reply

Not mentioned are the fantastic user communities, especially the culture of inclusiveness and openness fostered by RStudio ([code of conduct](https://github.com/tidyverse/dev-day-2019/blob/master/CODE_O...) and [rOpenSci](https://ropensci.org/community/). Basically the inverse of SnarkOverflow.

Especially rOpenSci's peer review process ([more here](https://devguide.ropensci.org/softwarereviewintro.html)) for R packages is fantastic.

I do most data engineering in R (RMarkdown workbooks), and most software engineering in Python/Django. It took three separate, dedicated attempts to get warm with R (pre-tidyverse, showing my age), now I'm interrupting work on an RShiny app to write this comment. The ecosystem around the tidyverse helps immensely to convert my colleague's workflows from Excel to R. Clarity and simplicity wins over purity here (you may now light your pitchforks). And NSE still breaks my brain.

[+] FranzFerdiNaN|6 years ago|reply

I think the community that R users have managed to create is it's strength. Python has a community of software developers, while R has a community of people who use programming to solve their problems. It's a fundamental difference in mindset, one which really shapes the community.

What also helps is that R is so focused on data and statistics. It gives a focus to the users that really helps when it comes to finding help. Python is famously second best at everything, but that also means it's community is spread thinner over more subjects.

[+] grammarxcore|6 years ago|reply

I too got started pre-tidyverse. I've done some minor analysis the last two years and been blown away by how easy it is to get up and running with that code. Way easier than it used to be. I actually bumbled my way through building a simple report building system pre-1.0.0. It was horrible in comparison.

I've gotten some adoption of R Studio at two companies now. It's amazing for exploratory analysis and its cloud capabilities are wonderful.

[+] sqrt17|6 years ago|reply

Python as a text processing language has been less convenient than Perl for a long time Python+(Django, Flask) as a Web service language hasn't been as convenient as Ruby on Rails for some time Python+numpy as a numeric computation language hasn't had all the features of Matlab Python+pandas+matplotlib as a data science language hasn't had all the features of R

but using Python throughout, or even Python with a sprinkling of Cython/C++/C for performance, allows for cleaner and faster engineering than using a special language for each niche.

I don't think that R has a bigger problem with non-programmers being bad software engineers than Python has - there are plenty of people who know Python passably and are quite happy that they can be productive without being good software engineers (versus Java where the intent of the language is biased for everyone to write code with a minimum quality standard rather than everyone to write code with focus on being productive). But you can find decent Python software engineers, and more recently you can find decent software engineers who also know enough of the niche in question to produce high-quality production code in that niche from the get-go rather than throwing models over a data science - engineering wall that exists between two departments.

[+] PerryCox|6 years ago|reply

I think this is a very valid point. There are times when Python is not the absolute best choice for a given problem, but in most cases it can allow you to achieve the desired goal in a way that is easy for newcomers to the particular codebase to grok very quickly.

There are times when it may be better to use another language and there is nothing wrong with that. I default to Python and if I think there will be specific issues with it, then I can look at a more specialized language.

[+] thetwentyone|6 years ago|reply

To give a quick comparison with Julia:

1. Native Data types - this is one of the things that Julia was designed to do very well. That is, native-like-treatment for all data without needing to have a C-family underbelly like Python does for its high performance code.

2. Non-Standard Evaluation - Julia has Metaprogramming[1] and Symbols[2] which provide similar ideas in a different way. It uses abstract syntax trees and is very lisp-like in that way if you wanted to get into writing Macros and such.

3. Package Management - Julia has a best-in-class built in package management system with versioning. Julia also has first-class support for documentation, so its very easy for developers to write relevant documentation. As an R user before RStudio, package management was a pain but RStudio hides the manual work that used to be searching for, downloading, and unpacking packages. Packages usually work really well together, usually automatically so you can often get really cool results[3] where other languages would require a lot of coordination (like Tidyverse).

4. Function paradigm - Julia is multi-paradigm and is conducive to functional, imperative, object-oriented, among others.

I'm a big Julia fan, after having gone R -> Python -> Julia. Not to make this totally in favor, I still like R for plotting because it's more mature. RStudio also is very nice for dynamically interacting with datasets, but Juno comes pretty close there too.

1: https://docs.julialang.org/en/v1/manual/metaprogramming/# 2: https://stackoverflow.com/questions/23480722/what-is-a-symbo... 3: https://www.youtube.com/watch?v=HAEgGFqbVkA

[+] kwertzzz|6 years ago|reply

Actually, for plotting I prefer to use PyPlot in Julia which is based on matplotlib which is very mature and complete in my opinion. I tried to use other (more native) plotting packages like GR, Plots and Makie but they did not provide all the plotting types I needed or where to rough around the edges.

In any case, I am looking forward to new julia versions which should address the delay in plotting (as far as I known).

[+] superMayo|6 years ago|reply

As an applied economist, I can assure you that R is wayyy better in terms of statistics than Python. Doing causal models in Python is a real pain because its data science community is mostly focused on machine learning. Even a robust linear model with IV or fixed effect (which is standard in causal models) is really hard to achieve in Python. Of course this argument is about the libraries and note the core language, but it's a strong argument to stay with R. This argument is also valid outside academics, as more and more data scientists try to adress causality.

[+] bransonf|6 years ago|reply

As a very frequent R user, I actually find non-standard evaluation to be more hassle than good in most situations.

If you want to program with most of the tidyverse libraries, you are forced to implement a bunch of non-sense into your function to properly evaluate arguments within a function. Sure, NSE may be useful in some circumstances, but more often than not, it just increases the likelihood of introducing a bug.

Especially for new programmers, NSE is a huge leap and very confusing.

The old solution was exporting underscore ie: ‘mutate_()’ suffixed functions that used standard evaluation. And this was fine. But, then RStudio decides to deprecate these functions and force NSE on users. I’m not happy about that, and I often avoid using libraries like dplyr when writing functions so that I don’t have to deal with it.

[+] meanmrmustard92|6 years ago|reply

Agreed. Recent updates to rlang have made programming dplyr/ggplot2 functions a bit better, but it still feels super clunky. I use data.table for most things for programmability and speed reasons.

As much as I like ggplot2, I find the rest of the tidyverse to be solving problems it invents (e.g. quosures to fix the problem of not permitting string arguments for dplyr verbs) and monopolising an open source ecosystem.

[+] naiveprogrammer|6 years ago|reply

Same. The moment I felt that my brain was melting over NSE and dplyr was the moment I started to phase out most tidyverse stuff from my work. I've switched to data.table and plain R for most of my stuff now.

[+] CapmCrackaWaka|6 years ago|reply

I actually use R mostly because of its data.table package. It is much faster and more concise than pandas, which is a nightmare to work with. Sure, you can get the job done in pandas, but you often have to wait ~10x longer for your commands to run and sometimes, I simply cannot use pandas at all because I run out of memory.

https://h2oai.github.io/db-benchmark/

People are usually pretty surprised when I take the stance that R is faster than Python for the things most people actually care about, which is data manipulation and model building. Python has its datatable library which is approaching data.tables speed however it is very much a work in progress, and does not have very useful features yet.

[+] qwhelan|6 years ago|reply

FWIW it looks like pandas is slow/OOM-ing because the benchmarks solely use Categoricals, which aren't as heavily used by pandas users compared to R.

In particular, I suspect the benchmark sizing is forcing falling back from numpy's int64 to Python ints as categorical labels, which easily could explain a 10x or more differential.

[+] psv1|6 years ago|reply

If you're:

- working interactively (i.e. your code isn't part of a larger application)

- working with relatively small datasets that fit into memory

- don't need any deep learning libraries

then both R and Python can do a great job and choosing one over the other is simply a matter of preference. I might even lean slightly towards R because its data frames are a bit easier to use than pandas and RStudio's REPL is the best.

But if you need to deploy your code somewhere, or high performance, or the latest deep learning libraries, then Python absolutely crushes R. And it's not even close.

[+] WhompingWindows|6 years ago|reply

How are you defining 'high performance'? I think R's data.table is quicker on a number of metrics than comparable packages from Python.

[+] fock|6 years ago|reply

while with high-performance you mean documented FFI and friends? this should be possible with R as well.

It also seems that the actual open-source ML-community (vs. Google: we want you to use our software to ensure you can't ever own your stuff) supports R just fine: https://mxnet.apache.org/api/r

[+] ekianjo|6 years ago|reply

Note that you can call Python from R, and vice versa too.

[+] LyndsySimon|6 years ago|reply

This mostly strikes me as ”I know R better than I know Python” - which is fair for deciding what to use, but is obviously not an objective comparison.

Hopefully I’ll have time tomorrow to write a rebuttal for some his arguments. Particularly, the preference for CRAN and code longevity strikes me as being shortsighted.

[+] missosoup|6 years ago|reply

R is objectively worse than Python for almost all data science tasks, and R is a huge PITA to productionise compared to Python.

I've yet to see any argument for R that doesn't boil down to 'well, I know it better' or 'well, I prefer the syntax'.

R to data science is as Matlab is to engineering. It's a stopgap 'non programmer' language that thrived at a time when most academics didn't know any programming. Now school children learn programming. There is no use case for these languages anymore.

[+] OldGuyInTheClub|6 years ago|reply

Insightful article on many fronts. I haven't had time to learn R but have been impressed by what it offers especially RStudio as an environment. I use Matlab from time-to-time and like having equivalent features that are of great help in initial data exploration and code experimentation. I haven't found anything fully equivalent in the Python world.

[+] uptownfunk|6 years ago|reply

R was that language I learnt so many years ago. I love it actually, have so many fond memories. It’s hard to find something better especially for prototyping. I will admit productionalizing has almost drag and drop simplicity in python compared to R.

I love love love what Hadley has done with Dplyr for the most part, at least in spirit, though I think the implementation could have been done better so as not to be so clunky, esp wrt NSE. But I think he is just trying to work within the current R ecosystem.

Which makes me ask.. is it then time for R2? (Like a Python 3). Before you shoot me.. Do we need to save the good things we have innovated from within the R ecosystem over the years and consider doing things from scratch?

Is this what Julia tried to do? I haven’t gotten around to trying it yet.

That said, I think R is always going to be there and have it’s place.

Frankly if they could just make an IDE like Rstudio that ran python, I’d probably be happy enough with that. I heard with reticulate you can run both, curious to hear of others experience with this..

[+] amrrs|6 years ago|reply

`reticulate` is a brilliant package for running Python within R. I guess it's good as long as it's local machine. I've not tried productionizing a code that has both R+Py. But it can work within Shiny too, helping bring Python's Datascience stack (esp. scikit-learn) within R.

[+] bayesian_horse|6 years ago|reply

You should try Jupyter lab. It has notebooks, text editors and more.

[+] tbenst|6 years ago|reply

The part on “Make code more concise“ struck me as an anti-pattern that should instead be a simple function composition. Or the functional programmer in me notes that the author re-invented the Maybe monad.

I do like the point in Learn the user’s language, as friendlier error messages is something we should all strive for, although I’ve never had an issue with particular problem in Python’s stacktraces, and actually having types like Julia or at least annotations via mypy seems a better solution.

CRAN is a great point, and pythons packaging is in a sorry state with a crazy number of approaches and undeclared dependencies. R does a great job here.

Functional programming section is ironic given the lack of functional patterns in the post. R has even fewer higher order functions than the python standard library.

It’s hard for me to see how R is better for production than python, and the argument against pandas seems a bit strawman considering that numpy/scipy are quite stable and more central to the ecosystem than DataFrame. R is fantastic for data science and highly productive, until you need to do data mugging or anything else that involves a general purpose language.

[+] cameronh90|6 years ago|reply

"CRAN is a great point, and pythons packaging is in a sorry state with a crazy number of approaches and undeclared dependencies. R does a great job here"

But for production usage, again it's a huge pain. It's difficult to keep version stability with developer machines since there's no standard lock file, and the CRAN servers often delete or silently update old versions of packages.

Packrat and Microsoft's MRAN really helps, but another curious issue is that it and other CRAN servers seem to have terrible stability - often going down for hours at a time (or worse).

Python's packaging is ~impossible to understand for new developers (and really absolutely needs improving), but in an organisation you just pick an approach suited for your use case.

[+] zamalek|6 years ago|reply

The problem with Python conciseness is that it requires cognition to write. That cognition needs to multiplied ten fold for understanding.

I think I'm a good developer, I can't understand idiomatic Python. Python could use some verbosity for the sake of everyone else. If R slows you IQ 9000 people down, please, make it standard.

[+] Gatsky|6 years ago|reply

R is interesting in that it has one library which has zero real competitors to the point where it becomes a justification for using the entire language: ggplot2. It's almost like ggplot2 crosses over from a package to an application, and R is the user interface. Any examples like this in other languages? I can't think of any, maybe Python & scikit-learn 5 years ago.

[+] nwallin|6 years ago|reply

Can you compare ggplot2 to matplotlib? 90 seconds of googling didn't seem to indicate to me that ggplot2 is particularly different, either in terms of its power or expressivity, than matplotlib.

Those 90 seconds of googling constitute my entire knowledge of ggplot2.

[+] alexilliamson|6 years ago|reply

Ruby and Rails?

[+] danielecook|6 years ago|reply

R is the right tool in many cases but I've observed several cases where people have become overreliant on it and hacked together things that would be better written in bash, python, or other languages.

[+] epistasis|6 years ago|reply

Is there any language for which the same could not be said?

[+] bayesian_horse|6 years ago|reply

One advantage of R is that it is easier or faster to teach enough R to non-programmer scientists such that they can do their own statistical stuff.

[+] wjak|6 years ago|reply

R is good for machine learning and for production. We have helped big orgs to incorporate this technology in their it ecosystem. We used our open-source product called R Suite to manage deployment issues. https://github.com/WLOGSolutions/RSuite

[+] bayesian_horse|6 years ago|reply

Some good points. Certain things are easier in R and often you can find code snippets that perform certain statistical tasks "the right way" and they are more concise than they would be in Python.

To me, R always felt a bit quirky.

I don't think R is much better at functional programming than Python. I found R to be limiting in terms of general programming.

Also, we now have type checking. I'm positive you could combine type checking and clever type declarations to handle application state like in elm.

[+] Cosi1125|6 years ago|reply

> I don't think R is much better at functional programming than Python.

R is a functional programming language. Does Python treat functions as first-class citizens? Pass arguments by value? Store expression trees as data structures?

[+] twelfthnight|6 years ago|reply

I'm happy to have read this article as it has a different perspective on R than I am used to. My general take is that Python really is a better suited language for production workflows, while R is superb at interactive workflows. I love R, but chiefly for RMarkdown, Shiny, ggplot and obscure stats packages rather than large machine learning codebases. Here's why:

> Native data science structures.

DataFrames are often easier to use than Pandas. However, in production workflows we're often using more datatypes than DataFrames, and R is weaker there. For example:

* Lists must be accessed with `[[ ]]`, instead of `[ ]`. I've seen many silent bugs slip through due to this. * There are 3 competing implementation of classes. This results in classes being mystical and rarely understood. * R is a Lisp 2. Variables and functions may share the same name. This leads to confusing errors. * Catching specific types of errors can be awkward. * Adding elements to a list iteratively is slow [1]

> Non-Standard Evaluation

This can be handy while quickly working in RStudio, but it's not easy to maintain. I've seen code that failed because it specified `f(!!variable)` instead of `f(!!!variable)`. I like R's formula notation, but I'm happy enough with sklearn's API that I don't miss it.

> The glory of CRAN

CRAN is not set up for production. It makes pinning versions very difficult [2]. Many people resort to using MRAN, which is a Microsoft supported snapshot of CRAN at a specific time, so a dev can just pretend they are installing software as if it were 6 months ago. I have seen MRAN go down multiple times [3]. Not to mention, the owner of CRAN is notoriously prickly [4] and packages will not be accepted to CRAN unless the maintainer ensures their software runs on Solaris [5]. Hadley Wickham has done so much for the community with `devtools` and his books. He gets a lot of praise, but it's not misplaced.

> Functional programming

Okay, this is actually pretty great. Hooray functional programming! Not totally related, but R has a great polymorphic dispatch of functions, which really can't be undersold (the way automatic documentation generates for this is kisses fingers)

Ultimately, R is a cool language. In interactive settings, I would rather work in RStudio than Jupyter any day. I like RMarkdown better than Notebooks for sharing analysis, too. If there is a specific Bayesian model necessary only available in R, that's fine, wrap it in a container. But the rest of the ETL and pipeline code feels easier to write and maintain in Python.

[1] https://stackoverflow.com/questions/17046336/here-we-go-agai... [2] https://stackoverflow.com/questions/17082341/installing-olde... [3] https://github.com/Microsoft/microsoft-r-open/issues/51 [4] https://www.reddit.com/r/rstats/comments/2t5oqp/dont_use_the... [5] http://www.deanbodenham.com/learn/r-package-submission-to-cr...

[+] eden_h|6 years ago|reply

>Not totally related, but R has a great polymorphic dispatch of functions,

Dispatch in R is generally fine, but I see a great deal of UseMethod calls, and switch statements for types in the libraries I've worked with, which OTOH is just users using tools badly, but OTO R should enforce using a particular tool to solve problems. And R is particularly bad at enforcing anything, which is why we're left with S3, S4, and R6.

There's also the FFI issue across the board for Python/R where functions frequently barely clean naked FFI calls and leave it a complete mystery what's going on under the hood. I think R is generally worse at it though, where I've had memory leaks and sigterms that aren't visible in RStudio.

I do like the functional programming though. I had an excuse to use multi.argument.Compose from the functional library recently and it made me wish I had things like that to hand in all languages

[+] arh68|6 years ago|reply

> R is a Lisp 2. Variables and functions may share the same name. This leads to confusing errors.

Isn't that a Lisp-1, then? Maybe I've got them backwards. CL is a Lisp-2, and it's not unusable, so either #'readmacros are good enough or there's something else going on to balance out the ambiguity.

EDIT: I see what you're saying now, it's a Lisp-2. They can share the same name at the same time, not just 1 name referring to one value or the other.

[+] mikorym|6 years ago|reply

For those interested, there is a nice Vim plugin for R called Nvim-R.

[+] tomerbd|6 years ago|reply

I used to use R because of RStudio until JupyterLab came and flipped it all over now I use only bash and python for small data analysis as well.

[+] j7ake|6 years ago|reply

I use R for the statistical packages, handling data frames, and plotting. For other types of work I would switch to python.

117 comments