top | item 5611702

Practices in source code sharing in astrophysics

73 points| ngoldbaum | 13 years ago |arxiv.org | reply

63 comments

[+] Blahah|13 years ago|reply

I'm a computational biologist. If I read a paper where the results of an analysis are presented, but the code used is not, that analysis is worthless. Thankfully, in computational biology & bioinformatics at least, open source is widespread and growing.

Every programmer knows that every programmer makes mistakes. Lots of them. Mistakes that can completely change the outcome of an analysis. And if the results of your analysis are to be taken seriously, people have to be able to trust them, which means they have to be able to check your working.

Not publishing your code is anti-science, selfish, and IMHO should be disallowed in the literature.

[+] dalke|13 years ago|reply

"Not publishing your code is anti-science" ... "open source is widespread and growing"

I've noticed that few people realize that "publishing your code" and "open source" are two overlapping issues, but they are not the same and the needs of one field are sometimes opposed to the other field.

Open source necessarily implies that the people who receive the software have the right to modify the software, redistribute it with or without changes, and be able to do so for a charge.

No one has been able to tell me why software for scientific publications requires the third of those abilities. Everyone is agreed that access to the code and the ability to modify it makes it much easier to review and understand what it does. I can also understand the need to share modifications with collaborators, in order to help carry out the analysis. But I don't see why published software can't have a prohibition on charging a fee for using modified versions of the software, or services rendered which use the software.

My first question for you, then, is: do you need the ability to commercialize someone else's software in order to provide good scientific review of their publication? More specifically, what sorts of review would that prohibition eliminate?

In the other direction, I have scientific software which I sell for about $30K. (This is not hypothetical - I really do have this). Customers get the source code under the BSD license. This falls into every standard definition of "free software" and "open source" software. There's even an essay (at http://www.gnu.org/philosophy/selling.html ) encouraging people to sell free software.

If I publish a paper which uses the software, then am I obligated to give my peers access to the source code for no fee? Or can I publish the paper, say it's available under a BSD license, and charge $30K for access to it?

So my second question is, are there limits on what I can charge people in order to get access to my open source software, which was used for a paper? If so, what are they, and what is the ethical basis for that judgment? (For example, should it be "fair, reasonable and non-discriminatory"? Can it cover development costs? Distribution costs? Web site development costs?)

[+] new299|13 years ago|reply

If the paper is based on the analysis of sequence data, it's almost certainly been processed by closed or "shared source" primary data analysis tools. That's true of Illumina/Roche/ABI. Biotech vendors have quite a long way to go in this regard.

[+] anon_barcode|13 years ago|reply

In astronomy we are still fighting for public data. Verification of results comes from another researcher collecting his own data and running his own analysis.

Out of curiosity, do you work for industry or academia?

[+] gituliar|13 years ago|reply

I'm a particle physics theorist. In our field there are a lot of open-source (see http://www.hepforge.org/projects), as well as some private projects. I believe that source code should be published/shared in tandem with results it produces. Otherwise, those results have no scientific value and I personaly tend not to trust them, even though they are in agreement with well-known predictions. What matters is an algorithm results are obtained with, not results themselves. In other words, scientists should form a hacker-like community and don't get into the trap of a developer-user relationships where some develop while others use software.

I actually wanted to know your opinion concerning the migration of open-source to private software. The most common open-source and free software licenses, like MIT or GNU, allow this kind of migration. That is Bob could modify Alice's public project, produce results, and publish them without sharing sources with anybody. That looks unfair, since Alice did a great job and most likely would like to see the changes. How to protect Alice from such situations? Are there any kind of license which forces to share modified software when some results it produces are published or available to the general public?

[+] tripzilch|13 years ago|reply

Isn't that what GPL does?

[+] ngoldbaum|13 years ago|reply

As an astronomer, I hope this idea takes off. Unfortunately, judging by the reactions my colleagues have had when I've brought up this idea in the past, I don't think it's very likely.

[+] A_Allen|13 years ago|reply

I hope the idea takes off, too.

Providing codes is imperative; they are part of the methods and should be available for examination to ensure the integrity of the science. Those conducting research funded with public monies should be required by the funding agencies to release the products of that research, not just the results, but the data and codes, too (absent truly compelling reasons, such as national security). Eventually, I expect funding agencies will indeed require this for astronomy, just as they do for some other sciences; journals could help the field along, could improve the transparency and reproducibility of research, by requiring code release upon publication.

Absent funding agencies and journals insisting on code release and the moral argument of reproducibility, what incentives would help convince code authors to release their software?

[+] jared314|13 years ago|reply

Is it a code quality or a code portability issue?

[+] lutusp|13 years ago|reply

> While software and algorithms have become increasingly important in astronomy, the majority of authors who publish computational astronomy research do not share the source code they develop, making it difficult to replicate and reuse the work.

This is troubling. There's no field involving computation in which withholding source code is routinely accepted. The four-color map theorem proof (Appel & Haken, 1976) would never have been accepted without source code. Modern mathematics, to the degree to which it relies on computer results, also relies on the publication of source.

Another example is the recent revelation involving an error in an Excel spreadsheet and its effect on an economic analysis -- the correction wouldn't have been possible without publication of the spreadsheet alongside the conclusions drawn from it.

Also, replication is a cornerstone of serious science. Without replication, astrophysics becomes psychology, where replication is rare.

I hope this paper has the effect of correcting this systematic flaw in astrophysics publication.

[+] anon_barcode|13 years ago|reply

As an astronomer I hope this does not take off.

The threat to job security is not just "perceived". Astronomers are frequently kept in a state of constant job-induced anxiety by the prevalent practice of fighting for 6 month - 2 year "postdocs" for the first 15-20 years of their careers (seriously, go to an astronomical conference, you would not believe the amount of people greying under 35). To "keep" your job (in reality, get another postdoc) you must do two things: author papers and get papers cited.

This practice makes collaboration actually detrimental to a researcher's career unless one of two things happens: 1) they lead the collaboration and get their name as first author 2) the collaboration allows some kind of recompense for the time invested into the collaboration.

For this reason, many collaborations have a period of proprietarity, a time when the collaboration data is available exclusively to the people who have sunk their time into the collaboration. Imagine, if the people behind the Millenium or Aquarius simulations were forced to publish their code as soon as they had run the simulations (this is true for any theorist)-- now these people have spent the past years of their lives, tweaking and perfecting code _for which they get no publications or citations_ and before they can even begin to analyze the easiest results ("low hanging fruit", the simplistic papers that usually go to collaboration members), the simulations are being run and analyzed all over the world. They have gained nothing by their efforts in actually developing the code (our world does not work like industry, we don't get a pat on the back and a raise for being team players).

For theorists, often entire careers revolve around codes that have been developed over the researcher's whole career-- to be forced to hand it over to all the first year PhDs in the world is a tremendous slap in the face.

For me, as an "experimentalist" (read, data-analyst), I also maintain job attractiveness by having sets of code that no one else has. I'm experienced in a variety of pattern finding algorithms, and have even invented a few myself for very specific problems-- and within my little sphere, people know this about me, I'm the person people come to for certain things. I spent years of my life perfecting these, learning about algorithms, learning the intricacies and quirks of the various datasets we use-- if someone wants to take my place in this community as "that guy" then I expect them to devote as much time as I have to learning these techniques inside out and then to do something better than me. I have no desire to pass my code off to a masters student and let him naively plug some dataset into it-- firstly because no one should ever rely on a black box in this field, and secondly because these codes all need to be tweaked to account for the different instruments and data structures being used.

Perhaps if my contract didn't have a built in expiration I would be willing to care-bear-share my fractal search methods and spend my Thursday afternoons showing you how to Savitsky-Golay smoothing works, but as long as I'm out of a job next July, and you're the guy applying for my job, you can keep your filthy hands off my code.

[+] ErikCorry|13 years ago|reply

Wow. This reply has only served to illustrate how broken the current system is.

If you are publishing articles with no source code, how is that even science? Why not just skip the articles and publish unsubstantiated assertions? After all, if you publish details of how you worked stuff out, even without source code someone else might reproduce your work.

If the new rule was that you had to publish source then there would still be the same number of jobs in astronomy. I don't seem how _everyone's_ careers could be negatively impacted. And don't you think if you are using the software from another group you would cite the authors?

[+] Blahah|13 years ago|reply

"no one should ever rely on a black box in this field"

This is deeply hypocritical. When someone publishes results generated by code, but don't publish the code, they are asking all readers to rely on the output of a black box.

[+] krichman|13 years ago|reply

This is disappointing. I'll say that I wish the situation would change so that you would be able to get funding without the dog eat dog, allowing you to share source code and ideas. Think how much would the next version of you be able to do if he had access to the ideas you invented! Not to admonish you; obviously it's the situation that's causing this.

It's utterly bizarre that there would be any field of science that actively deterred collaboration because without collaboration it seems incredibly difficult to achieve great things.

[+] scott_s|13 years ago|reply

I'm confused by one of your scenarios:

Imagine, if the people behind the Millenium or Aquarius simulations were forced to publish their code as soon as they had run the simulations (this is true for any theorist)-- now these people have spent the past years of their lives, tweaking and perfecting code _for which they get no publications or citations_ and before they can even begin to analyze the easiest results ("low hanging fruit", the simplistic papers that usually go to collaboration members), the simulations are being run and analyzed all over the world. They have gained nothing by their efforts in actually developing the code (our world does not work like industry, we don't get a pat on the back and a raise for being team players).

My confusion: who said they had to release their code when it was finished? My understanding is that they would only release their code when they first publish. So I don't see a problem here: when they finish their code, but before they can analyze their first results, they have no external pressures.

However, once they publish those preliminary results? Yes, there will be external pressures, because others can start using their code to also look for interesting results. But that's the way things should progress. The original authors will still have an enormous benefit from having a deep understanding of the code, and the techniques they used in it.

I'm a computer scientist. I work on systems software research. I work in industry, so I can't release most of my code. When I was a PhD student, I always released my code. I wanted people to try to build on what I did. In fact, releasing my code is what has actually gotten me citations - people have either used my code directly in a larger project, or they have extended and improved upon it.

In systems software, academics always release their code. They want people to use it and build on it. The frequency with which other groups are able to beat out the original authors is almost zero - understanding a non-trivial code base takes time, and if the original authors are still working on the same problem, they have an enormous advantage.

Your attitude protects you, but prevents other from learning from you.

[+] tripzilch|13 years ago|reply

> For me, as an "experimentalist" (read, data-analyst), I also maintain job attractiveness by having sets of code that no one else has. I'm experienced in a variety of pattern finding algorithms, and have even invented a few myself for very specific problems-- and within my little sphere, people know this about me, I'm the person people come to for certain things. > I spent years of my life perfecting these, learning about algorithms, learning the intricacies and quirks of the various datasets we use-- if someone wants to take my place in this community as "that guy" then I expect them to devote as much time as I have to learning these techniques inside out and then to do something better than me. > I have no desire to pass my code off to a masters student and let him naively plug some dataset into it-- firstly because no one should ever rely on a black box in this field, and secondly because these codes all need to be tweaked to account for the different instruments and data structures being used.

> Perhaps if my contract didn't have a built in expiration I would be willing to care-bear-share my fractal search methods and spend my Thursday afternoons showing you how to Savitsky-Golay smoothing works, but as long as I'm out of a job next July, and you're the guy applying for my job, you can keep your filthy hands off my code.

First, WOW. In can't imagine any sector (except indeed academia) where a "keep your filthy hands off my code" attitude like that is remotely acceptable.

Second question is, whose code, exactly? In most lines of work, if you write code for an employer (university in this case, I suppose), the copyright for that code is implicitly given to the employer. In this particular case, it means they could, as well as should force you to open this code. It could very well be that your contract explicitly states different, it has to be clear about code though, it's not just implied together with writings, for instance.

Imagine if you were to work for, say, a security analysis consultancy firm, and you write a lot of cool machine learning and analysis code to detect intrusions or leakage. Regardless of whether your contract expires, if you leave, you're not leaving with your code. And if you'd refuse to document it properly so that the next guy can't use it, expect the contract to expire prematurely.

I can imagine that would seem frustrating and scary NOW, but only because they made it seem like yours was a proper approach for all those years you gave it. Of course it wasn't--and deep down inside you know this to be true--if only everything had been open from the start.

[+] davorak|13 years ago|reply

This seems to be getting closer to the root of the problem, any insight into why the job market is set up that way?

[+] J_Thomas|13 years ago|reply

"I spent years of my life perfecting these, learning about algorithms, learning the intricacies and quirks of the various datasets we use-- if someone wants to take my place in this community as "that guy" then I expect them to devote as much time as I have to learning these techniques inside out and then to do something better than me. I have no desire to pass my code off to a masters student and let him naively plug some dataset into it-- firstly because no one should ever rely on a black box in this field, and secondly because these codes all need to be tweaked to account for the different instruments and data structures being used."

The result is that you are important, and no one else can check on you. That's kind of good for you, but there should be a way you can do better.

My junior year in college some psychologists told me about a stastistician who would once a year write a paper demonstrating ways that psychologists mis-used some statistical technique. He would quote 20 or 30 psychology papers and explain why their statistics were wrong and so their conclusions were wrong. They lived in fear of him.

If there existed robust and reasonably transparent code to do what you do, along with great documentation to show users where the pitfalls are, and you got a valued publication every time you showed that a significant result was done wrong and you gave the better version, very likely you would be better off. I'm pretty sure astronomy would be better off.

I don't know how we could get from here to there, but it's something to consider.

[+] nilx|13 years ago|reply

I can't help putting this comment in the context of "Top Ten Reasons to Not Share Your Code" by Randall J Leveque (http://faculty.washington.edu/rjl/icerm2012/icerm_reproducib...).

"The gist of the article is to urge readers to reconsider current attitudes about sharing code related to publications by pondering an “alternative universe” in which mathematics papers are not expected to contain the proofs of theorems. Many of the objections I hear repeatedly to sharing code can be applied to such a universe.[...]

4. Giving the proof to my competitors would be unfair to me. It took years to prove this theorem, and the same idea can be used to prove other theorems. I should be able to publish at least 5 more papers before sharing the proof. If I share it now my competitors can use the ideas in it without having to do any work, and perhaps without even giving me credit since they won’t have to reveal their proof technique in their papers."

[+] ylem|13 years ago|reply

I'm sorry, but this attitude is just wrong. In order to trust results, scientific source needs to be verified. I'm a physicist and I've seen people not share code for a few reasons: 1) The code smells bad--I've had code that's useful, but that I feel is really ugly--but even that I've released when people have asked because it's better than having it just sitting around 2) To gain papers--for example, once I needed maxent for something and a guy wouldn't release his source and wanted to run that portion of the problem to be a co-author--so, I just wrote my own 3) Competitive advantage--like the OP says, he wants to keep an edge by having code that other people don't. Personally, I prefer to keep an edge by doing better science. I think if you're coming up with software that others use, then you still get some indirect credit from it, but I think that things progress better if the code is available and everyone isn't wasting time reinventing the wheel--it also helps for error-checking.

[+] kd0amg|13 years ago|reply

Contrast with the CPU simulator SimpleScalar, whose source code is freely available to academic users. Google Scholar lists hundreds of citations for the original tech report (you wouldn't do work based on someone else's simulation code without citing them, would you?), so it seems safe to say the work has done quite a bit for the author's professional status.

[+] qwerta|13 years ago|reply

I had similar dilemma at second semester at university. I changed my career to business and do astronomical software as hobby. Now I probably write more astronomical software than if I would stay at field. It is more relaxing and I can focus on long-term projects.

[+] stargazer-3|13 years ago|reply

As a first-year master student in astronomy, I fully agree with you.

[+] crntaylor|13 years ago|reply

Have you also seen the Open Exoplanet Catalogue proposal by Hanno Rein? (http://arxiv.org/abs/1211.7121)

I have no idea if you know him or not, but if not, you should consider getting in touch!

[+] teuben|13 years ago|reply

Isn't that exactly what the virtual observatory already has? They base their tables on XML, and call it a VOTable. Lots of fabulous software can read and write those.

[+] unknown|13 years ago|reply

[deleted]