top | item 4436092

Can we make accountable research software?

24 points| urish | 13 years ago |bytesizebio.net | reply

10 comments

[+] bravura|13 years ago|reply

I've commented on this topic before (open notebook science).

Academic code should be released. In fact, I believe that you should host everything on a public github/bitbucket from day 1.

I don't like the proposal in this blog post, i.e. that we insist that code must work in a distributed fashion under some consortium guidelines. Increasing the friction to releasing code is the wrong approach.

I believe that open notebook science should be incentivized by being aligned with career goals. Publications and conferences should have separate tracks where you cannot submit unless you will release your code. Then, academics who want to release their code have the edge, because these tracks are lower competition.

The feeling that you must release polished code is the wrong idea. Pushing everything to github removes the friction in sharing code, and has been successful in the past.

The author complains that neither he nor his students have the time to polish code or write documentation. Here's my solution: I've released research code under a "pay it forward" support license. If you ask me for support, and I give you support over email, please document what you learned through your investigation and discussion with me, and then push your changes to me.

[edit: formatting]

[+] etal|13 years ago|reply

We should be clear which of two kinds of scientific code we're talking about:

1. A program that implements a new technique which forms an important part of a research project. Maybe a program that is the research project, which will be described in a paper.

No doubt this code should be included with the publication, no matter how "ugly" it is. Some journals, e.g. Bioinformatics, already require that an article about software must include the software itself. This is the stuff the Bioinformatics Testing Consortium would run a smoke test on, because amazingly, a lot of programs that have been written up as journal articles just don't compile or work at all on somebody else's machine; many articles don't include the source code, and some don't even say how to get a redistributable binary. That's wrong, and we can fix it.

2. The mountain of single-use scripts and shell commands that are used in a research project that's not really about software at all, only a small fraction of which produce some output that the scientist follows up on.

Key points: (1) this code is very unlikely to work on anyone else's machine as-is; (2) crucial parts of these pipelines are lost in the Bash history, or were executed on a 3rd-party web server, or depend on a data set on loan from a collaborator who is not ready to release the data yet; (3) almost all of the code is dead; (4) whatever comments or notes exist are usually misleading or completely wrong.

As an example of what can go wrong when this code is released as-is, remember when the East Anglia Climate Research Unit "hide the decline" stuff hit the fan? It wasn't clear which code was dead, the comments made no sense, and people freaked because they couldn't be sure how the published results came out of that godawful mess. The eventual solution, way too late, was to make a proper open-source, openly developed software project out of the important bits. That, in a nutshell, is why scientists won't release ALL the code -- even the hard drive itself is not the whole story; the scientist still needs to be available to explain it and navigate over the red herrings. And getting code into a state where it's self-explanatory takes time.

[+] freyrs3|13 years ago|reply

The barrier is convincing researchers that their code doesn't have to be beautiful to open source it, it just has to be checkable by someone else with the same domain knowledge.

Fortunately, a lot of the scientific Python community seems to be getting behind the idea of building IPython Notebooks that document scientific workflow. That seems to be a step in the right direction.

[+] bo1024|13 years ago|reply

Some of these are simply excuses. Publishing your code doesn't mean you have to support it or maintain it. Just make the code available and be done with it. The important thing is that it's out there.

[+] tikhonj|13 years ago|reply

To me, it seems that many of the problems with producing reasonable code could be remedied by writing better code the first time. In my experience, writing somewhat less hacky code is usually not that much slower than writing completely hacky code. Moreover, it often saves time--there have been many occasions when I spent far longer dealing with a stupid bug than I would have writing cleaner code in the first place.

Now, learning to do this well does take some time. However, if you're going to be writing a fair bit of code, I think this time will more than pay for itself. The amount of effort you can save down the road by spending a little bit more time at the very beginning is not negligible.

So: I think researchers should try writing cleaner code (rather than cleaning up poorly written code at the end). I would not be surprised in the least if, after they've gotten used to it, they're actually more efficient than before!

[+] randlet|13 years ago|reply

There's an aspect that this article neglects to mention; rightly or wrongly, some scientists believe keeping your source code private can be a competitive advantage over other researchers.

Unfortunately, science is an extremely competitive business and providing your competitors with tools to help them publish more papers faster doesn't always make good business sense.