Advantages of Using R Notebooks for Data Analysis Instead of Jupyter Notebooks

[+] cle|8 years ago|reply

Once I got serious about this stuff, I started using Emacs with Org Mode and ESS. It has most of the features listed here. It's just a text file, magit is incredible, you can export to HTML and there are lots other formats available as ox plugins (e.g. MediaWiki), excellent LaTeX support, Org Mode is incredible for organizing large analyses and managing todos, you can glue together anything from anywhere (R, Python, shell scripts, Spark clusters, SQL, remote processes, etc.), and so on. Considering I code in addition to doing data analysis, I can reuse all my coding stuff to do data analysis too. I can take notes in Org Mode during a meeting, and then afterwards do some analysis directly on those meeting notes, export to HTML/PDF, and send it to a colleague.

It is missing the quick backtick interpolation that you get with R-Markdown, and some of the nice UI stuff like inline Shiny graphs and clickable tables (but it is easy to output Org-format tables that can be piped into other languages).

[+] confounded|8 years ago|reply

The disadvantage is collaboration; I don't know of anyone that uses a non-Emacs Org implementation, and adopting an editor/religion/lifestyle just to collaborate on a project is a huge ask. I find that in Emacs, ESS + polymode works great for rmarkdown files, providing many of the benefits, in a file that will also be interpretable in Rstudio.

Reproducible research must be reproducible by the unenlightened, after all!

[+] cat199|8 years ago|reply

IMHO R notebooks, jupyter and their ilk are essentially emacs for the masses.. which is neither good nor bad, but is.

[+] lottin|8 years ago|reply

Yes, ESS is fantastic. The only thing I envy about Rstudio is the GUI to save plots.

[+] hooloovoo_zoo|8 years ago|reply

My only complaint about org mode is that nobody seems to want to make caching work with noweb references even though it's an old issue.

[+] baldfat|8 years ago|reply

I switched from Python to R about 3 years ago. I missed iPython (Now Jupyter) for a long time. Then I just got attached to RStudio.

I have tried R in Jupyter a few times and it was nice but the advantages in R Notebooks is just awesome. Git playing nice is the best advantage.

I still am clueless to the religious Python vs R and the smack that is read that "serious" work is done on in Python? R works best for me.

[+] autokad|8 years ago|reply

why do many say 'serious' work gets done in Python? R is great for linear models, but I find it tedious for many other things such as machine learning. However, I wouldn't classify that as 'serious', just that I find one performs better for different tasks.

its already been said, but I do NLP a lot. R handles text poorly. humans use a lot of text.

tensorflow, neural networks, etc is better in Python

between pandas, list comprehensions, python collections library, sklearn, spyder, I feel I have a lot of power at my finger tips and its easy to do most of the machine learning I want.

importing a package takes a meaningful amount of time in R. Several seconds, that is just unacceptable.

its a personal matter, but R has syntaxes that get on my nerves. python list: a = [1,2,3] a = c(1,2,3). perhaps its because i used other languages before, but my fingers are more adept at hitting [ which requires no shift compared to (. some people love curly braces and lots of parentheses in if/for statements, I appreciate them not being there.

I have to fight with R on scientific notation, always copy - pasting into my code: options(scipen=999)

that said, spyder is buggy, and R studio is fantastic. I still haven't come across a good python IDE that is par with R studio.

edit: I forgot to say, I feel pyspark is far superior to sparkr. last i seen, sparkr only works with a VERY old version of spark. I dont even think that version is supported anymore. this is a bit of a big deal to me

[+] doug1001|8 years ago|reply

I'm not a python or R dev (Scala dev actually whose day job consists substantially of re-factoring python and R code written by data scientists and data analysts to run on the JVM on prod servers). Sure python is easier to grok for up-stream ETL/data processing, but that's commodity work (or it should be anyway) and not a solid basis to compare R vs python. R has far more packages than the "scientific" python portion of pypi and for certain domains the quality (and quantity) of the packages in R makes the better choice; examples: signal processing (or any more-than-routine time-series analysis; seismic interpretation, finance, experimental design, chemo-metrics, etc. And with strict use of the datatables package--coercing dataframes as datatables and using that package's syntax to manipulate your data, R is very fast. Ignore the smack and leave those folks to their "serious" work

[+] thearn4|8 years ago|reply

Out of curiosity, what was the motivation for switching from Python to R for analysis, was there a particular R package that you were looking to use?

[+] paultopia|8 years ago|reply

The thing that makes me sad about R is textmining. TM makes me sad, strings-as-factors makes me very sad. But maybe I'll try tidytext...

[+] neaden|8 years ago|reply

"No cell block output is ever truncated. Accidentally printing an entire 100,000+ row table to a Jupyter Notebook is a mistake you only make once." Hah, sadly this is not the case for me.

[+] williamstein|8 years ago|reply

Jupyter now has output rate limitations (I pushed for them to add them), though I think they may be off by default. I also implemented something better (I think), which is an output buffer, for CoCalc's version of collaborative Jupyter notebooks. Instead of rate limiting, CoCalc saves only the last part of the output, discarding earlier output, and provides a link to get it.

[+] radarsat1|8 years ago|reply

Yeah that sounds like something that should at least be possible to fix pretty easily using css max height and an overflow=scroll setting.

[+] cosarara97|8 years ago|reply

Sage deserves a mention: http://www.sagemath.org/

[+] inputcoffee|8 years ago|reply

The big question in data science is: should I spend more time learning Python or R?

The answer is always: math

[+] lanstin|8 years ago|reply

One of the best things for R is the mailing list. At leasr when I was learning stats and R rhe knowledge both of how to do stuff in R and what math to use when was phenomenal. If gentle people didnt answer in time Prof Brian Ripley from Oxford would answer early morning British time and explain why your question was wrong and what the math meant and what you really meant to do and why and then 3 lines of R to do it.

[+] tomrod|8 years ago|reply

Which I find leads you to Python, then Julia, then bracket. /s

[+] nl|8 years ago|reply

Can I say how damn good notebooks (any notebooks) are for data exposition compared to traditional coding environments?

I'm more familiar with Jupyter than R Notebooks. I'd second the point about version control in Jupyter being.. hard. There isn't really a good pattern for it yet.

I would note that I believe the latest version of Jupyter has prettier tables though!

Edit: Also, matplotlib makes me sad. Surely there could be something better which abandons it completely?

[+] noobiemcfoob|8 years ago|reply

Re: matplotlib

You have other options like bokeh and plotly

[+] MichailP|8 years ago|reply

Can someone here recommend good practical statistics book? Something with modern methods, but explained sufficiently in depth?

[+] minimaxir|8 years ago|reply

For statistical programming, since we're talking about R, I strongly recommend R for Data Science (http://r4ds.had.co.nz) by Hadley Wickham (who created a large amount of the R packages that are very commonly used [tidyverse] and incidentally also now works for RStudio)

A good book on statistical theory is harder to come by, though.

[+] sn9|8 years ago|reply

Introduction to Statistical Learning is free and quite good: http://www-bcf.usc.edu/~gareth/ISL/

Follow it up with Elements of Statistical Learning by three of the same authors for more advanced stuff.

[+] mncharity|8 years ago|reply

Perhaps not quite on topic, but introductions to statistics which take a Bayesian approach are starting to exist. Like http://xcelab.net/rm/statistical-rethinking/ or perhaps https://github.com/equinn1/MTH225_Spring2016 .

[+] JumpCrisscross|8 years ago|reply

Each field has their own "good practical statistics book". I work in finance and so recommend Fabozzi. It's good, but so are many other foundational texts. Your requirement for practicality necessarily negates a one true answer.

[+] reacharavindh|8 years ago|reply

I am currently going through this free book : https://www.openintro.org/stat/textbook.php?stat_book=os

[+] ryankennedyio|8 years ago|reply

It will depend a lot on your field, but a solid grasp of fundamental probability theory should be applicable everywhere.

I think this is an excellent overview [1]. Learning probability from a measure theory angle is more difficult to grok compared to the frequentist approach everyone is more familiar with, but I found it much more enjoyable. (I learnt the usual way from doing computer science undergrad, but now re-doing it more rigorously for masters in financial engineering)

[1]: http://www.math.uah.edu/stat/index.html

[+] neaden|8 years ago|reply

What's your background and what exactly do you mean by modern methods? An Introduction to Statistical Learning is good and you can download the pdf: http://www-bcf.usc.edu/~gareth/ISL/ it assumes you have a pretty decent background in mathematics though.

[+] tgb|8 years ago|reply

Call someone explain in what sense these notebooks are "reproducible" to a greater extent than just a .py or R file? I'm not that familiar with them. Do they have key metadata or something?

[+] Rotten194|8 years ago|reply

Writing a bunch of scripts can quickly become a mess. I was working on some twitter analysis for a project, and not really worrying about the code because I didn't intend for it to be used again, and it quickly became a mess of "run this script, then run that script on the generated file, then use this shell command to process the file, then run the final analysis step on that file, then clean up all the intermediates". Not to mention, say, "one-time" data cleanup through the shell / REPL that runs into problems months down the line when you want to update the data set. And, of course, invariably none of this is documented. Notebooks don't force you to organize your code and write documentation, but they strongly encourage it.

[+] vthriller|8 years ago|reply

Relevant discussion: https://news.ycombinator.com/item?id=14035092

[+] tmulc18|8 years ago|reply

Really cool article. It still seems like jupyter is the better longterm option because it offers so many different kernels.

[+] jbmorgado|8 years ago|reply

The team working on Spyder (the Python closest alternative to RStudio) have something like R Notebooks in their roadmap for a while now, but it keeps being pushed into the future.

I wish they could use RStudio for a while and understand just how important is the feature for someone using Python for research.

[+] rscho|8 years ago|reply

Closest alternative to RStudio (on principle, if not in practice): http://rodeo.yhat.com

[+] eruditely|8 years ago|reply

I don't see this replacing Jupyter notebooks any time soon, as they simply are better.

70 comments