Once I got serious about this stuff, I started using Emacs with Org Mode and ESS. It has most of the features listed here. It's just a text file, magit is incredible, you can export to HTML and there are lots other formats available as ox plugins (e.g. MediaWiki), excellent LaTeX support, Org Mode is incredible for organizing large analyses and managing todos, you can glue together anything from anywhere (R, Python, shell scripts, Spark clusters, SQL, remote processes, etc.), and so on. Considering I code in addition to doing data analysis, I can reuse all my coding stuff to do data analysis too. I can take notes in Org Mode during a meeting, and then afterwards do some analysis directly on those meeting notes, export to HTML/PDF, and send it to a colleague.
It is missing the quick backtick interpolation that you get with R-Markdown, and some of the nice UI stuff like inline Shiny graphs and clickable tables (but it is easy to output Org-format tables that can be piped into other languages).
The disadvantage is collaboration; I don't know of anyone that uses a non-Emacs Org implementation, and adopting an editor/religion/lifestyle just to collaborate on a project is a huge ask. I find that in Emacs, ESS + polymode works great for rmarkdown files, providing many of the benefits, in a file that will also be interpretable in Rstudio.
Reproducible research must be reproducible by the unenlightened, after all!
why do many say 'serious' work gets done in Python? R is great for linear models, but I find it tedious for many other things such as machine learning. However, I wouldn't classify that as 'serious', just that I find one performs better for different tasks.
its already been said, but I do NLP a lot. R handles text poorly. humans use a lot of text.
tensorflow, neural networks, etc is better in Python
between pandas, list comprehensions, python collections library, sklearn, spyder, I feel I have a lot of power at my finger tips and its easy to do most of the machine learning I want.
importing a package takes a meaningful amount of time in R. Several seconds, that is just unacceptable.
its a personal matter, but R has syntaxes that get on my nerves. python list: a = [1,2,3] a = c(1,2,3). perhaps its because i used other languages before, but my fingers are more adept at hitting [ which requires no shift compared to (. some people love curly braces and lots of parentheses in if/for statements, I appreciate them not being there.
I have to fight with R on scientific notation, always copy - pasting into my code: options(scipen=999)
that said, spyder is buggy, and R studio is fantastic. I still haven't come across a good python IDE that is par with R studio.
edit: I forgot to say, I feel pyspark is far superior to sparkr. last i seen, sparkr only works with a VERY old version of spark. I dont even think that version is supported anymore. this is a bit of a big deal to me
I'm not a python or R dev (Scala dev actually whose day job consists substantially of re-factoring python and R code written by data scientists and data analysts to run on the JVM on prod servers). Sure python is easier to grok for up-stream ETL/data processing, but that's commodity work (or it should be anyway) and not a solid basis to compare R vs python. R has far more packages than the "scientific" python portion of pypi and for certain domains the quality (and quantity) of the packages in R makes the better choice; examples: signal processing (or any more-than-routine time-series analysis; seismic interpretation, finance, experimental design, chemo-metrics, etc. And with strict use of the datatables package--coercing dataframes as datatables and using that package's syntax to manipulate your data, R is very fast. Ignore the smack and leave those folks to their "serious" work
"No cell block output is ever truncated. Accidentally printing an entire 100,000+ row table to a Jupyter Notebook is a mistake you only make once." Hah, sadly this is not the case for me.
Jupyter now has output rate limitations (I pushed for them to add them), though I think they may be off by default. I also implemented something better (I think), which is an output buffer, for CoCalc's version of collaborative Jupyter notebooks. Instead of rate limiting, CoCalc saves only the last part of the output, discarding earlier output, and provides a link to get it.
One of the best things for R is the mailing list. At leasr when I was learning stats and R rhe knowledge both of how to do stuff in R and what math to use when was phenomenal. If gentle people didnt answer in time Prof Brian Ripley from Oxford would answer early morning British time and explain why your question was wrong and what the math meant and what you really meant to do and why and then 3 lines of R to do it.
Can I say how damn good notebooks (any notebooks) are for data exposition compared to traditional coding environments?
I'm more familiar with Jupyter than R Notebooks. I'd second the point about version control in Jupyter being.. hard. There isn't really a good pattern for it yet.
I would note that I believe the latest version of Jupyter has prettier tables though!
Edit: Also, matplotlib makes me sad. Surely there could be something better which abandons it completely?
For statistical programming, since we're talking about R, I strongly recommend R for Data Science (http://r4ds.had.co.nz) by Hadley Wickham (who created a large amount of the R packages that are very commonly used [tidyverse] and incidentally also now works for RStudio)
A good book on statistical theory is harder to come by, though.
Each field has their own "good practical statistics book". I work in finance and so recommend Fabozzi. It's good, but so are many other foundational texts. Your requirement for practicality necessarily negates a one true answer.
It will depend a lot on your field, but a solid grasp of fundamental probability theory should be applicable everywhere.
I think this is an excellent overview [1]. Learning probability from a measure theory angle is more difficult to grok compared to the frequentist approach everyone is more familiar with, but I found it much more enjoyable. (I learnt the usual way from doing computer science undergrad, but now re-doing it more rigorously for masters in financial engineering)
What's your background and what exactly do you mean by modern methods? An Introduction to Statistical Learning is good and you can download the pdf: http://www-bcf.usc.edu/~gareth/ISL/ it assumes you have a pretty decent background in mathematics though.
Call someone explain in what sense these notebooks are "reproducible" to a greater extent than just a .py or R file? I'm not that familiar with them. Do they have key metadata or something?
Writing a bunch of scripts can quickly become a mess. I was working on some twitter analysis for a project, and not really worrying about the code because I didn't intend for it to be used again, and it quickly became a mess of "run this script, then run that script on the generated file, then use this shell command to process the file, then run the final analysis step on that file, then clean up all the intermediates". Not to mention, say, "one-time" data cleanup through the shell / REPL that runs into problems months down the line when you want to update the data set. And, of course, invariably none of this is documented. Notebooks don't force you to organize your code and write documentation, but they strongly encourage it.
The team working on Spyder (the Python closest alternative to RStudio) have something like R Notebooks in their roadmap for a while now, but it keeps being pushed into the future.
I wish they could use RStudio for a while and understand just how important is the feature for someone using Python for research.
[+] [-] cle|8 years ago|reply
It is missing the quick backtick interpolation that you get with R-Markdown, and some of the nice UI stuff like inline Shiny graphs and clickable tables (but it is easy to output Org-format tables that can be piped into other languages).
[+] [-] confounded|8 years ago|reply
Reproducible research must be reproducible by the unenlightened, after all!
[+] [-] cat199|8 years ago|reply
[+] [-] lottin|8 years ago|reply
[+] [-] hooloovoo_zoo|8 years ago|reply
[+] [-] baldfat|8 years ago|reply
I have tried R in Jupyter a few times and it was nice but the advantages in R Notebooks is just awesome. Git playing nice is the best advantage.
I still am clueless to the religious Python vs R and the smack that is read that "serious" work is done on in Python? R works best for me.
[+] [-] autokad|8 years ago|reply
its already been said, but I do NLP a lot. R handles text poorly. humans use a lot of text.
tensorflow, neural networks, etc is better in Python
between pandas, list comprehensions, python collections library, sklearn, spyder, I feel I have a lot of power at my finger tips and its easy to do most of the machine learning I want.
importing a package takes a meaningful amount of time in R. Several seconds, that is just unacceptable.
its a personal matter, but R has syntaxes that get on my nerves. python list: a = [1,2,3] a = c(1,2,3). perhaps its because i used other languages before, but my fingers are more adept at hitting [ which requires no shift compared to (. some people love curly braces and lots of parentheses in if/for statements, I appreciate them not being there.
I have to fight with R on scientific notation, always copy - pasting into my code: options(scipen=999)
that said, spyder is buggy, and R studio is fantastic. I still haven't come across a good python IDE that is par with R studio.
edit: I forgot to say, I feel pyspark is far superior to sparkr. last i seen, sparkr only works with a VERY old version of spark. I dont even think that version is supported anymore. this is a bit of a big deal to me
[+] [-] doug1001|8 years ago|reply
[+] [-] thearn4|8 years ago|reply
[+] [-] paultopia|8 years ago|reply
[+] [-] neaden|8 years ago|reply
[+] [-] williamstein|8 years ago|reply
[+] [-] radarsat1|8 years ago|reply
[+] [-] cosarara97|8 years ago|reply
[+] [-] inputcoffee|8 years ago|reply
The answer is always: math
[+] [-] lanstin|8 years ago|reply
[+] [-] tomrod|8 years ago|reply
[+] [-] nl|8 years ago|reply
I'm more familiar with Jupyter than R Notebooks. I'd second the point about version control in Jupyter being.. hard. There isn't really a good pattern for it yet.
I would note that I believe the latest version of Jupyter has prettier tables though!
Edit: Also, matplotlib makes me sad. Surely there could be something better which abandons it completely?
[+] [-] noobiemcfoob|8 years ago|reply
You have other options like bokeh and plotly
[+] [-] MichailP|8 years ago|reply
[+] [-] minimaxir|8 years ago|reply
A good book on statistical theory is harder to come by, though.
[+] [-] sn9|8 years ago|reply
Follow it up with Elements of Statistical Learning by three of the same authors for more advanced stuff.
[+] [-] mncharity|8 years ago|reply
[+] [-] JumpCrisscross|8 years ago|reply
[+] [-] reacharavindh|8 years ago|reply
[+] [-] ryankennedyio|8 years ago|reply
I think this is an excellent overview [1]. Learning probability from a measure theory angle is more difficult to grok compared to the frequentist approach everyone is more familiar with, but I found it much more enjoyable. (I learnt the usual way from doing computer science undergrad, but now re-doing it more rigorously for masters in financial engineering)
[1]: http://www.math.uah.edu/stat/index.html
[+] [-] neaden|8 years ago|reply
[+] [-] tgb|8 years ago|reply
[+] [-] Rotten194|8 years ago|reply
[+] [-] vthriller|8 years ago|reply
[+] [-] tmulc18|8 years ago|reply
[+] [-] jbmorgado|8 years ago|reply
I wish they could use RStudio for a while and understand just how important is the feature for someone using Python for research.
[+] [-] rscho|8 years ago|reply
[+] [-] eruditely|8 years ago|reply