top | item 25645224

Groundhog: Addressing the Threat That R Poses to Reproducible Research

161 points| snakeboy | 5 years ago |datacolada.org | reply

120 comments

order
[+] st1x7|5 years ago|reply
Either I'm misunderstanding or this is a non-problem. You can specify older versions of a package when you install it. You can also manage them with packrat. As long as researchers share their language and package versions, you can fully reproduce their environment. (And the base language is really stable, almost to a fault.)

This is just a bad way for the author to promote their own library for dealing with this. The way their library seems to approach this (using dates instead of versions) seems horrible too - on any given date I can have a random selection of packages in my environment, some of them up-to-date, some of them not. So unless all researchers start using the author's library (and update to the latest versions of everything just before they publish), it's only making things worse and not really solving the problem it claims to solve.

[+] kumarsw|5 years ago|reply
The impression I get is that this tool has a forensic bent to it. You ask for the code for a paper and Joe grad student with no programming knowledge emails you a zipped folder of R scripts. He just finished his dissertation, is starting a new job somewhere on the west coast, and no longer has access to the computer in his old advisor's lab where he did the work. The implementation may (or may not be) be lousy, but the use case sounds plenty valid.
[+] dash2|5 years ago|reply
Yeah, this seems half-thought-through. renv works at project level and isolates the dependencies of a project from your main library. groundhog.library() tramples over your library installing multiple versions. It also has the "cute" feature of auto-installing libraries if they aren't on your system already. Yuck. If you really wanted this script-only solution then you could go with the `versions` library, which already lets you specify an installation date.[1]

[1]: https://cran.r-project.org/web/packages/versions/index.html

[+] meztez|5 years ago|reply
Fully agreed. Add `sessionInfo()` output as an appendix to your publication. Should not be too hard to rebuild from that.
[+] _Wintermute|5 years ago|reply
Correct me if I'm wrong but specifying an older package version in R still pulls the newest packages from CRAN for any dependencies, which is quick way to run into a load of incompatibilities.

I've not tried renv yet but packrat was a pretty poor solution.

[+] kwertzzz|5 years ago|reply
I full agree. With version number you can have a good sense whether a package update breaks your code or not (as long the package authors following semantic versioning).

I think in julia this problem is solved quite nicely with the Project.toml (list of package that you directly dependent) and Manifest.toml file (the version numbers of the complete dependency tree which is automatically generated).

It seems that in groundhog you declare only direct dependencies. Is there a way to store the full dependency tree in R ?

[+] mslip|5 years ago|reply
Not really relevant but just to note packrat had been soft-deprecated and is superseded by renv, which comes standard with rstudio.
[+] ekianjo|5 years ago|reply
Packrat is deprecated now. It is recommended to use renv instead.
[+] roel_v|5 years ago|reply
Apart from all the other considerations and problems with various types of package management, consider this:

"Update January 6th, 2021 A reader alerted me to a bug with the current groundhog (version 1.1.0) where you cannot set the groundhog library to be a folder containing spaces in the name."

So we are talking about software here that somehow made it to version 1.1 *without anyone ever using a directory with spaces in it with it". This can be interpreted in two ways: either very few people have spaces in their paths, or very few people have actually ever even tried (not even really used, I'm only talking about the most basic trial use) this package. I'm not a betting man, but if I were, I know where I'd put my money...

[+] bayindirh|5 years ago|reply
As I can see from the researchers in our cluster and my own academic research, most people still avoid spaces in paths and files like the plague.

YMMV of course.

[+] kristaps|5 years ago|reply
Don't remember the source and probably misquoting, but I like this truism: there's software that people complain about and software that nobody is using.
[+] jcelerier|5 years ago|reply
> This can be interpreted in two ways: either very few people have spaces in their paths

it's been years since I've seen anyone doing that - a main reason, is that a very widely used dev tool, make, does not handle spaces in paths:

http://savannah.gnu.org/bugs/?712

thus leading to inertia in the whole ecosystem - if make does not support spaces in paths, why bother

[+] IshKebab|5 years ago|reply
> So we are talking about software here that somehow made it to version 1.1 without anyone ever using a directory with spaces in it with it.

This is extremely common, especially on Linux. Basically anything that uses things like Bash or CMake will almost certainly not work in directories containing spaces.

Developers don't use paths containing spaces because it causes so many issues with badly written Bash scripts, and as a result they don't test their code with paths containing spaces.

Bash and CMake and similar hacked together languages have very error-prone quoting rules that make it very easy to accidentally make something work with paths without spaces but fail on paths with spaces.

[+] CJefferson|5 years ago|reply
If you start discarding software which has problems with a space in a directory name, you should start with libtool, at which point you can't build significant chunks of the Linux ecosystem.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=193163

I hit this when trying to test libgmp (as an example of an important library you would lose).

This means in practice you can't really build most software which uses configure scripts and libraries in a directory with a space -- this may well be what they are hitting.

[+] mattmanser|5 years ago|reply
It doesn't even seem to be on GitHub, in fact the source doesn't seem to be listed anywhere on the project website.

Which in our world would scream 'complete amateur, avoid, avoid, avoid', but perhaps it's different in the R world.

[+] tpxl|5 years ago|reply
Could also be that the package manager doesn't use spaces and most people use package managers?

Ie maven will create a folder structure like "/home/user/.m2/repository/com/example/example.jar" which will never have spaces unless the username has spaces (Can linux usernames have spaces?).

[+] GlennS|5 years ago|reply
I know of two other existing solutions to this, although I don't know enough to compare. I don't think either of these tick all the author's boxes.

Microsoft MRAN https://mran.microsoft.com/

> For the purpose of reproducibility, MRAN hosts daily snapshots of the CRAN R packages and R releases as far back as Sept. 17, 2014.

MRAN doesn't seem to be very well known or used in the R community, but I don't really know why?

Separately, Nix https://nixos.org/ also solves this problem for lots of different languages, but is difficult to get started with and still a bit rough around the edges. Probably not a good recommendation for a typical analyst or academic at this point.

[+] chalst|5 years ago|reply
The article discusses MRAN in footnote 5, when arguing against the MRAN-based 'checkpoint' approach.

Nixpkg/Nixos is obviously a useful technology for reproducibility, but note that the output of Nix scripts can depend on the time the system was built, the contents of URLs and the system architecture unless care is taken.

[+] warlog|5 years ago|reply
It looks like this is much more fine grained compared to mran, i.e., with groundhog, you select the date vs with mran where you use the last (often > year old) snapshot.

mran is a great idea and if Rstudio (the defacto gate-keepers of the faith -- with Hadley the high priest) pushed to use mran, then the R community would follow suit (like they do for everything else).

This would do a lot to bring MS into the fold, which would actually be great for R.

[+] _Wintermute|5 years ago|reply
MRAN has saved my bacon more than once when I need to replicate some R environment written years ago. The package management in R really is terrible.
[+] asperous|5 years ago|reply
Reproducible? Or deterministic?

There's certainly benefits to being able to pull down research source code, and bug checking it. That's how programmers check code: tests and audits.

However I think reproducing research is more often then not done "from scratch", taking a new sample, treating it, checking results. "independent verification".

Re-using source code saves time, but I would argue not being able to shouldn't threaten reproducibility.

[+] cauthon|5 years ago|reply
This title is an exceedingly hot take for someone who wrote a new package manager.

Also, it appears that Groundhog is itself a CRAN package and the author recommends installing with install.packages(). So is the author committing to never making any backwards incompatible updates to their new package?

[+] bsza|5 years ago|reply
I think it's more like a Wayback Machine for R programs, since the author of a science paper isn't required to use groundhog. You can just provide it the date the article was published, which you already know, and it reconstructs how the program worked on that day.

Also, because groundhog isn't made for the author to use, whether or not the interface changes is irrelevant. You'll never encounter library(groundhog) in a paper.

[+] jsmith99|5 years ago|reply
That's the problem. This package is very similar to Microsoft's checkpoint package which is based on Microsoft's MRAN snapshots, and this package also uses MRAN. The article explains the difference is that this package allows you to specify the date in the code itself, whereas checkpoint is used to set a whole installation to a specific date. But this is no advantage as it means code will stop working if the groundhog package changes, whereas with checkpoint a paper could just say 'use packages as of date x'.
[+] coolreader18|5 years ago|reply
> So is the author committing to never making any backwards incompatible updates to their new package?

Well, yes, probably. It's not all that hard, and groundhog seems to have a fairly simple API anyways.

And groundhog still uses CRAN packages, it just brings a method of pinning them to a specific version.

[+] resonantjacket5|5 years ago|reply
Your take seems a bit 'hot' too?

How else would you install the cran packages without using install.packages? Unless if you want them to recursively install it using groundhog but that seems unnecessary.

As long as you have the timestamp it should work, though I assume there will be some edge case.

What you're saying is like don't use pip because you don't install it using pip? Or don't use package-lock.json because you can't install npm through npm?

[+] mespe|5 years ago|reply
While I agree with many of the negative comments here about issues with how this is implemented, the tone of some comments is... not great. To the point that I would be reluctant to share work I do in R on Hacker News, which is not helping anyone.

Just a reminder: https://news.ycombinator.com/newsguidelines.html

[+] anderscarling|5 years ago|reply
I’ve yet to use it personally, but renv [1] seems to try to solve the reproducible builds problem in a way more similar to other modern package managers (e.g. by generating a lockfile).

This approach enables stricter validations against tampering with the package repositories as a hash of the package can be stored in the lockfile, however it is obviously a bit more complex to use than the groundhog approach.

[1]: https://github.com/rstudio/renv

[+] notafraudster|5 years ago|reply
Agreed that renv is a better solution here. Even the example code for Groundhog is not written in idiomatic R which does not inspire confidence. Simonssohn is a legend in transparent research but not primarily a coder or software tool contributor (take a look at the source for p-curve if you want to see what I mean) and I think a secondary threat to reproducibility is relying on tools that end up abandoned or deprecated or for which bugs never get fixed.
[+] prepend|5 years ago|reply
I came here to say this.

This seems like a non-issue given renv. And renv gives a more reproducible, I think, solution as it pins to versions, not dates.

[+] threeseed|5 years ago|reply
Nothing about this is specific to R.

If you want to guarantee reproducible results you have to use a container/image with libraries added at build time. Anytime you are relying on floating versions or downloaded libraries you will have issues.

[+] tempay|5 years ago|reply
Even this isn’t enough to be reproducible for complex numeric code as switching CPU can make a big difference with small differences being amplified. Hopefully none of those cases matter but it’s hard to definitively prove that.
[+] jdc|5 years ago|reply
Yeah or even just vendorize your dependencies.
[+] vhhn|5 years ago|reply
There are two camps in the R world - tidyverse and base-R (tiniverse).

Its not a coincidence that the author gives an example from the tidyverse ecosystem. Authors and users of tidyverse value other things like consistency and new features over API stability and backward compatility. The base-R ecosystem is actually very stable and so the original package manager is very simple.

With R spreading out from the academic environment and with many new authors breaking their packages' APIs we observe new attempts to solve the issues with dependencies (such as renv or https://rsuite.io)

[+] dracodoc|5 years ago|reply
Title aside, the purposed solution just

- use Microsoft MRAN which did the heavy lifting of hosting archives

- use date instead of version

- install package automatically in first time (which pacman::p_load has been doing for ages) and easier to use in script level.

It's not coincidence that most package manager solutions used version instead of date to control the environment:

- A paper published on 2017 may used a date in 2017.10.01, but there is a high possibility that some of the dependency packages might be of earlier date, unless the author update packages every day/week, which is not a good habit anyway because updating too frequently will break things more frequently.

- Then how can you reproduce the environment using a date? The underlying assumption that all packages will be latest till that date simply doesn't hold.

That's why packrat/renv etc will use a lock file to record all package versions, and why you will need a project to manage libraries, because you will need to maintain different library environments and cannot install to same location.

Yet the author take installing all packages to a single location as a feature since you don't need to install same package again, and try to avoid project and prefer script as much as possible when doing reproducible research?

[+] wodenokoto|5 years ago|reply
Wow, that's a lot of pessimism for a fairly elegant solution to the fact that almost no R code has package versioning defined.

I think the major sales point here is:

> A nice feature of groundhog is that it makes 'retrofitting' existing code quite easy. If you come across a script that no longer works, you can change its library() statements for groundhog.library() ones, using as the groundhog.day the date the code was probably written (say when it was posted on the internet), and it may work again.

I don't know how good ratpack is now a days. I've never met an R application that uses it, but at my old work, we would take a dated snapshot of CRAN at the beginning of every new project. If we needed to update a package we could then "update CRAN" for that project. When productionising a project it would be frozen to a date in CRAN.

[+] timsneath|5 years ago|reply
I'm curious whether this actually solves the problem. I understand how this assists with reproducibility of packages, but the R software itself is updated frequently, as is briefly noted in the preamble to this document. Indeed, the release notes [0] are fairly transparent about the relatively long list of changes.

Given this, it almost seems more dangerous to imply through this package that a particular date's results are reproducible, since unless the user has the same version of R, they may see different results anyway.

[0]: https://stat.ethz.ch/pipermail/r-announce/2020/000653.html

[+] ppod|5 years ago|reply
Very clickbaity headline. The problem described is real, but just as real or worse in other statistical software, so it's not 'R' as a whole that poses a threat to reproducibility.
[+] qrohlf|5 years ago|reply
I’ve had some brief run-ins with R, and it doesn’t surprise me that it doesn’t have a versioning story for packages, and that the patched-in system described here is based on dates rather than something like a SHA or version number...

My favorite description of the language comes from http://arrgh.tim-smith.us/:

> R is a shockingly dreadful language for an exceptionally useful data analysis environment.

I feel like this is just one more data point to support that statement.

[+] stewbrew|5 years ago|reply
I wish the (default) utils::install.packages function could take a version number of the requested library. I also wish library() would automatically install libraries not available on the system. (Both can be achieved with custom functions that shadow the default ones but I would like to see this functionality in the base packages.) Other than that, I think all alternatives to this "threat called R" are worse. It's telling the author has to cite a bug from 2016 for an example of a breaking change.
[+] SoSoRoCoCo|5 years ago|reply
> The problem is that packages are constantly being updated, and sometimes those updates are not backwards compatible.

Uh oh, someone just discovered the modern programming landscape!

Python, Node, R, Rust, and other langs/OSes with package managers are at the mercy of volunteers who keep important packages healthy. Once issues stop being fixed, y'all better have local copies. This used to be predominantly an OS issue, now it is a language issue, too.

[+] samch93|5 years ago|reply
Can recommend the paper "A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker" by Peikert and Brandmaier [1], which shows a much more robust approach to reproducibility.

[1] https://psyarxiv.com/8xzqy/

[+] hermitcrab|5 years ago|reply
Being able to assemble a solution from parts (as in R packages) is super flexible. But complex and potentially brittle.

Reproducability is a big problem all around. When I create releases I put the binaries as well as the source in version control, because changes in tools/libraries etc mean that I probably won't be able to create the exact same binary several years later from the same source.

There is always a tradeoff between flexibility and simplicity. Clearly software needs to be able to change, or you are never going to be able to improve it or fix bugs. And an assembly of constantly changing parts is clearly going to come with its own challenges.

My own software product, Easy Data Transform (which competes with R to some extent) trades off some flexibility for simplicity by having a single set of binaries for each platform. You can't add any components (without hacking). So the same version of software should always give the same result.