R especially dplyr/tidyverse is so underrated. Working in ML engineering, I see a lot of my coworkers suffering through pandas (or occasionally polars or even base Python without dataframes) to do basic analytics or debugging, it takes eons and gets complex so quickly that only the most rudimentary checks get done. Anyone working in data-adjacent engineering work would benefit from R/dplyr in their toolkit.
Why not mix R and Python in interactive analysis workflows:
1) Download positron: https://github.com/posit-dev/positron
2) Set up a quarto (.qmd) notebook
3) Set up R and Python code chunks in tour quarto document
4a) Use reticulate to spawn a Python session inside R and exchange objects beween both languages (https://github.com/posit-dev/positron/pull/4603)
4b) Write a few helper functions that pass objects between R and Python by reading/writing a temporary file.
As someone who is learning probability and statistics for recreation, I wholeheartedly agree. I wish I had come across R and dplyr/tidyverse/ggplot2 back in college while learning probability and stats. They were quite boring and drudgery to study because I wasn't aware of R to play around with data.
I love R and dplyr. It is very readable and easy to explain to non-programmers. I use it almost everyday.
Not exactly on the topic,I am having difficulties debugging it. May be I need to brush up on debugging R. Not sure if there is a easy way to add breakpoint when using vscode.
what’s the story integrating R code into larger software systems (say, a saas product)?
I’m sure part of Python’s success is sheer mindshare momentum from being a common computing denominator, but I’d guess the integration story is part of the margins. Your back end may well already be in python or have interop, reducing stack investment and systems tax.
Tangentially, R can help produce living Markdown documents (.Rmd files). A couple of ways include pandoc with knitr[0] or my FOSS text editor, KeenWrite[1]. I've kept the R syntax in KeenWrite compatible with knitr. Living documents as part of a build process can produce PDFs that are always up-to-date with respect to external data sources[2], which includes source code.
Last time I was working on something complex, I was able to knit from Rmd to md, and then use my usual pandoc defaults, which was quite neat. Big recommendation on that workflow.
I will say, now after 15 years messing with this. With LLM I just do it all in Python. But, I still miss the elegance and simplicity of R for data manipulation and analysis. Especially the dplyr semantics. They really nailed it. I think they got crushed by the namespace / import system. There’s something about R that makes you so fluid and intuitive. But the engineering, the efficiency, I get with Python now, I can’t go back.
Funny you mention namespacing: R 4.5.0 was just released today with the new `use()` function, which allows you import just what you need instead of clobbering your global namespace, equivalent to python’s `from x import y` syntax.
I agree with all your comment… except the very last bit. Do you really find python to be more efficient at engineering stuff than R? And especially speed, which in my experience at least is broadly the same if not faster with R because it interages easier with Rust and C++?
Having seen Julia proposed as the nemesis of R (not python, that too political, non-lispy)
>the creator of the R programming language, Ross Ihaka, who provided benchmarks demonstrating that Lisp’s optional type declaration and machine-code compiler allow for code that is 380 times faster than R and 150 times faster than Python
(Would especially love an overview of the controversies in graphics/rendering)
In my opinion, Julia has the best alternative to dplyr in its Dataframes.jl package [1]. The syntax is slightly more verbose than dplyr because it's more explicit, but in exchange you get data transformations that you can leave for 6 months and when you come back you can read and understand very quickly. When I used R, if I hadn't commented a pipeline properly I would have to focus for a few minutes to understand it.
In terms of performance, DF.jl seems to outperform dplyr in benchmarks, but for day to day use I haven't noticed much difference since switching to Julia.
There are also APIs built on top of DF.jl, but I prefer using the functions directly. The most promising seems to be Tidier.jl [2] which is a recreation of the Tidyverse in Julia.
In Python, Pandas is still the leader, but its API is a mess. I think most data scientists haven't used R, and so they don't know what they're missing out on. There was the Redframes project [3] to give Pandas a dplyr-esque API which I liked, but it's not being actively developed. I hope Polars can keep making progress in replacing Pandas, but it's still not quite as good as dplyr or even DF.jl.
For plotting, Julia's time to first plot has got a lot better in recent versions, from memory it's something like 20 seconds a few years ago down to 3 seconds now. It'll never be as fast as matplotlib, but if you leave your terminal window open you only pay that price once.
I actually think the best thing to come out of Julia recently is AlgebraOfGraphics.jl [4]. To me it's genuinely the biggest improvement to plotting since ggplot which is a high bar. It takes the ggplot concept of layers applied with the + operator and turns it into an equation, where + adds a layer on top of another, and the * operator has the distributive property, so you can write an expression like data * (layer_1 + layer_2) to visualise the same data with two visualisations. It's very powerful, but because it re-uses concepts from maths that you're already familiar with, it doesn't take a lot of brain space compared to other packages I've used.
The comment you linked is a response to my comment where I tried (and failed) to articulate the world in which R is situated. I finally "RTFA" and the benchmark I think perfectly deomonstrates why conversations about R tend not to be very productive. The benchmark is of a hypothetical "sum" function. In R, if you pass a vector of numbers to the sum function, it will call a C function sum. That's it. In R when you want to do lispy tricky metaprogramming stuff you do that in R, when you want stuff to go fast you write C/C++/Rust extensions. These extensions are easy to write in a really performant way because R objects are often thinly wrapped contiguous arrays. I think in other programming language communitues, the existence of library code written in another language is some kind of sign of failure. R programmers just do not see the world that way.
Julia is what I mostly use. I used R in the past, but I was all the time puzzled from the documentation. It did not work for me. Sometimes I fire the REPL for some interpolation, but I limit myself to what I understand.
Totally agree. R is pure pirate energy. Half the functions are hidden on purpose, the other half only work if you chant the right incantation while facing the CRAN mirror at dawn.
Thanks! Paid books do note (above the link) that they're paid but I agree, a better visual might help. I'm thinking of removing the paid books where many free alternatives are available
One of my students codes exclusively in Python. But in most cases newer econometrics methods are implemented in R first. So he just uses rpy2 to call R from his Python code. It works great. For example, recently he performed Bayesian synthetic control using the R code shared by the authors. It required stan backend but everything worked.
There is also https://www.rplumber.io/, which lets you turn R functions into REST APIs. Calling R from Python this way will not be as flexible as using rpy2, but it keeps R in its own process, which can be advantageous if you have certain concerns relating to threading or stability. Also, if you're running on Windows, rpy2 is not officially supported and can be hard to get working.
Not sure what you mean by "python backend". If you mean calling R from Python, rpy2 mentioned in the other comment works well. If you mean the other direction, RStudio has this all built in. This is probably the best place to start: https://rstudio.github.io/reticulate/articles/calling_python...
Been working 8 years with Rs data.table package in research and now after I changed to the private sector I have to use python and pandas. Pandas are so terrible compared to data.table it defies belief. Even tidyverse is better than pandas which is saying something.
I miss it so much
I'm the curator of Big Book of R and am really happy to see it on the front page of HN :). New books are added every 6 weeks or so and I send a notifications of the new adds to my newsletter subs. Link is at the footer of every page
[+] [-] cye131|11 months ago|reply
[+] [-] aquafox|11 months ago|reply
[+] [-] vishnugupta|11 months ago|reply
Well, better late than never I guess.
[+] [-] kasperset|11 months ago|reply
[+] [-] joshdavham|11 months ago|reply
[+] [-] wwweston|11 months ago|reply
I’m sure part of Python’s success is sheer mindshare momentum from being a common computing denominator, but I’d guess the integration story is part of the margins. Your back end may well already be in python or have interop, reducing stack investment and systems tax.
[+] [-] fithisux|11 months ago|reply
[+] [-] thangalin|11 months ago|reply
[0]: https://yihui.org/knitr/
[1]: https://keenwrite.com/
[2]: https://youtu.be/XSbTF3E5p7Q?list=PLB-WIt1cZYLm1MMx2FBG9KWzP...
[+] [-] haberman|11 months ago|reply
[+] [-] juujian|11 months ago|reply
[+] [-] uptownfunk|11 months ago|reply
[+] [-] tylermw|11 months ago|reply
e.g. avoid dplyr overriding base::filter
use(“dplyr”, c(“mutate”, “summarize”))
[+] [-] dkga|11 months ago|reply
[+] [-] gsf_emergency_2|11 months ago|reply
Having seen Julia proposed as the nemesis of R (not python, that too political, non-lispy)
>the creator of the R programming language, Ross Ihaka, who provided benchmarks demonstrating that Lisp’s optional type declaration and machine-code compiler allow for code that is 380 times faster than R and 150 times faster than Python
(Would especially love an overview of the controversies in graphics/rendering)
https://news.ycombinator.com/item?id=42785785
[+] [-] Hasnep|11 months ago|reply
In terms of performance, DF.jl seems to outperform dplyr in benchmarks, but for day to day use I haven't noticed much difference since switching to Julia.
There are also APIs built on top of DF.jl, but I prefer using the functions directly. The most promising seems to be Tidier.jl [2] which is a recreation of the Tidyverse in Julia.
In Python, Pandas is still the leader, but its API is a mess. I think most data scientists haven't used R, and so they don't know what they're missing out on. There was the Redframes project [3] to give Pandas a dplyr-esque API which I liked, but it's not being actively developed. I hope Polars can keep making progress in replacing Pandas, but it's still not quite as good as dplyr or even DF.jl.
For plotting, Julia's time to first plot has got a lot better in recent versions, from memory it's something like 20 seconds a few years ago down to 3 seconds now. It'll never be as fast as matplotlib, but if you leave your terminal window open you only pay that price once.
I actually think the best thing to come out of Julia recently is AlgebraOfGraphics.jl [4]. To me it's genuinely the biggest improvement to plotting since ggplot which is a high bar. It takes the ggplot concept of layers applied with the + operator and turns it into an equation, where + adds a layer on top of another, and the * operator has the distributive property, so you can write an expression like data * (layer_1 + layer_2) to visualise the same data with two visualisations. It's very powerful, but because it re-uses concepts from maths that you're already familiar with, it doesn't take a lot of brain space compared to other packages I've used.
[1] https://dataframes.juliadata.org/ [2] https://github.com/TidierOrg/Tidier.jl [3] https://github.com/maxhumber/redframes [4] https://aog.makie.org/
[+] [-] CreRecombinase|11 months ago|reply
[+] [-] fithisux|11 months ago|reply
BTW I am a senior Java / Python developer
[+] [-] barrenko|11 months ago|reply
[+] [-] vharuck|11 months ago|reply
https://www.burns-stat.com/pages/Tutor/R_inferno.pdf
[+] [-] fn-mote|11 months ago|reply
The invention of the Tidyverse freed new R programmers from 126 pages of gotchas.
Tell them to learn to use the tidyverse instead. For most of them, that will be all they ever need.
[+] [-] wpollock|11 months ago|reply
[+] [-] DadBase|11 months ago|reply
[+] [-] hcarvalhoalves|11 months ago|reply
https://bookdown.org/ndphillips/YaRrr/
[+] [-] madcaptenor|11 months ago|reply
[+] [-] oscarbaruffa|11 months ago|reply
[+] [-] madcaptenor|11 months ago|reply
One comment: it would be good to distinguish between books that are free and books that you have to pay for.
[+] [-] oscarbaruffa|11 months ago|reply
[+] [-] kingkongjaffa|11 months ago|reply
I’ve been tempted to port to python, but some of the stats libraries have no good counterparts, so, is there a ergonomic way to do this?
[+] [-] malshe|11 months ago|reply
[+] [-] jjr8|11 months ago|reply
[+] [-] bachmeier|11 months ago|reply
[+] [-] jmalicki|11 months ago|reply
[+] [-] huijzer|11 months ago|reply
[+] [-] ebri|11 months ago|reply
[+] [-] fhsm|11 months ago|reply
[+] [-] hughess|11 months ago|reply
R and RMarkdown were big inspirations for what we're building at evidence.dev now, so very grateful to everyone involved in the R community
[+] [-] loa_observer|11 months ago|reply
repo: https://github.com/Kanaries/GWalkR site: https://kanaries.net/gwalkr
[+] [-] dikip|11 months ago|reply
[deleted]
[+] [-] LostMyLogin|11 months ago|reply
[+] [-] oscarbaruffa|11 months ago|reply
[+] [-] marginatum|11 months ago|reply
[+] [-] Annatar|11 months ago|reply
[deleted]
[+] [-] qomioo|11 months ago|reply
[deleted]
[+] [-] brcmthrowaway|11 months ago|reply
[+] [-] hadley|11 months ago|reply
[+] [-] kgwgk|11 months ago|reply
[+] [-] countrymile|11 months ago|reply