top | item 17822849

(no title)

Having been in the academia I can tell you that from my experience there's two main reasons why scientists are often reluctant to share their code:

A) embarassement at having other people see how bad it is. Not necessarily wrong, but bad in other ways.

B) wanting to keep an advantage by continuing to publish on top of the work already done, whereas anyone else wanting to get in on the same idea would have to first re-build the original work.

discuss

3pt14159|7 years ago

Scientists have to have the truth of their malpractice rubbed hard against their face before they admit that they have a problem. Since Aristotle science has be wrought with what essentially amounts to bullshit and it only ever changes once so many people call bullshit so frequently that a new standard develops.

Not pre-publishing a research thesis / methodology is bullshit.

Not publishing data is bullshit.

Not publishing exact code (and runtime, etc) is bullshit.

Not publishing when research is complete is bullshit[0].

Not publishing is bullshit. I don't care who your funder is, it should be illegal for funders to selectively publish research. It's obviously one sided propaganda. The only time not publishing makes sense is when there are national security or global security ramifications to the research or if funding needs to be lined up for patents.

[0] I've seen a PhD candidate delayed 4 years from publishing because her supervisor was trying to coordinate a team to publish with research along the same vein all at once to make it a groundbreaking set of discoveries. It's basically p-hacking by some other name.

hermitdev|7 years ago

Not academia, but in finance, I remember an email interaction that was very nearly company-wide at a large hedge fund. In both our production & development environments, all generated core files on our Linux servers were configured to be dumped to a shared NFS mount (ostensibly to ease debugging). Each environment had around a 1 or 2 TB mount. Teams were expected to clean up their cores when they were no longer needed (useful for debugging those heisenbugs). Nearly all of the code was C++. Think 10s to 100s of millions of lines of C++ that had to work together spread across hundreds of services, countless libs that were used to interconnect.

Anyways, we were working on a fairly large company-wide effort. We were migrating from 32-bit to 64-bit, had 2 OS upgrades (new version of RHEL, migrating from Windows XP to Windows 7), 2 compiler upgrades (I forget the gcc versions - it came with the upgrade, and on Windows VS2003 to VS2008). Because of the coupling of the libs & services, both the Linux & Windows sides had to be released at the same time. We could update the Windows boxes at anytime and run the old software on it, but we basically couldn't develop the old software on Win7 as VS2003 wasn't compatible (it could be made to work, but the IDE or compiler or linker may fail randomly and hard). This is just to explain the scope of the effort we were making.

Back to why I mentioned the core files. To anyone that's done it, it's obvious the magnitude of effort doing this all at once is. There will be bugs. Specifically a metric-shit-ton of them. Those core files were needed. They stopped being generated because our shared core space filled up. The culprit? A PhD quant running some model in Python using some C/C++ extensions kept crashing Python, each one dropping a multi GB core file. This one quant was single-handedly using over half of the shared space. When confronted/inadvertently shamed in a development-wide email (we sent out usage breakdowns per user/project when a usage threshold was exceeded), his response was golden: "What are these core files, and how do I stop them from being generated?" Uniform response from everyone else: "Fix your damn code!" Mind, at the time, this was in front of ~600 developers.

The kicker: the reason this whole effort was being made? Because someone sent the CEO an XLSX spreadsheet in 2008 (the exact year may have been later, but immaterial to the story), and still being on WinXP & Office 2003, he couldn't open it. So...down the rabbit hole we went.

mannykannot|7 years ago

It is not rational to hold code to a lower standard than experimental technique.

As for B, that harkens back to the pre-scientific attitudes of medieval alchemy.

ur-whale|7 years ago

B is utterly unacceptable if you're funded by public money. As a matter of fact, I don't understand why public research grants don't come with a binding obligation to make all results available, including source code.

adrianN|7 years ago

While still in academia I wrote some code to extract some data from images. It took about a week before I was happy with it. We then decided to make it available for other people. Polishing it until somebody else could use it with reasonable effort took several months and a team of students. That is time that could have been spent writing more papers. It is not at all clear whether enough people actually use the software to justify the effort.

Researchers aren't paid (and usually aren't trained) to write software that can be installed on more than one computer. I don't like that situation either, but that how it currently is. Research code also suffers from accelerated bit rot. The dependencies are often research code themselves and projects are abandoned when the results are published.

YorkshireSeason|7 years ago

   It took about a week before 
   I was happy with it.

And what made you thing that "I was happy with it" is enough from the point of view of the pursuit of scientific truth (e.g. how did you deal with the problem of confirmation bias)? More than half a century of experience with software engineering has told us the hard way that it's near impossible to write bug-free code. And where this happens (e.g. aviation software) it takes extreme dedication and resources.

There is a reason that Alan Turing invented program verification in the 1940s [1].

Let me finish with a quote from computing pioneer M. Wilkes [2]: "I well remember when this realization first came on me with full force. The EDSAC was on the top floor of the building and the tape-punching and editing equipment one floor below. [...] It was on one of my journeys between the EDSAC room and the punching equipment that "hesitating at the angles of stairs" the realization came over me with full force that a good part of the remainder of my life was going to be spent in finding errors in my own programs."

[1] A. M. Turing, Checking a Large Routine. http://www.turingarchive.org/browse.php/b/8

[2] M. Wilkes, Memoirs of a Computer Pioneer, MIT Press, 1985, p. 145.

itronitron|7 years ago

I would rather the publication give a clear enough description that someone with time and resources could reproduce the software based on that description and achieve the same results. I think that would actually foster more innovation and collaboration as people implementing the software from the publication will necessarily become much more familiar with the work than they would have otherwise.

BeetleB|7 years ago

When I was in grad school, B was routine and open - not something you'd whisper but something you'd say openly.

ApostleMatthew|7 years ago

In my case, it is very much column A.

ur-whale|7 years ago

If you're embarrassed by your research code, how do you know it works, and how do you know what you're going to publish isn't built on quicksand?

fwdpropaganda|7 years ago

It's the most common one, that's why I put it first :D