The Irreproducibility Crisis of Modern Science

[+] lisper|8 years ago|reply

The site is down so I can't read the original report, but I've read reports on this topic in the past so I'm going to chime in with some "usual suspects" caveats:

1. No result is 100% reproducible because you can never completely reproduce the conditions of any experiment. The best you can hope to do is to reproduce the conditions that matter, but enumerating those has to be part of the theory you are testing, and so you can never be 100% sure that you have a complete list.

2. Even a completely non-reproducible result can be scientifically significant. For example, celestial events are almost never reproducible. Our understanding of celestial mechanics nonetheless rests on solid science.

3. The end-product of science is not truth, it is explanations of observations. Those observations can (indeed must) include non-reproducible ones. Sometimes the explanation of non-reproducible results is "experimental error" or "delusion" or "we just don't know." But non-reproducible events are nonetheless within the purview of science.

On the other hand...

4. The statistical tests currently in widespread use as a criterion for publication in peer-reviewed journals guarantee that at least one result in 20 will be due to chance and not because the hypothesis being tested is actually true. That, combined with the suppression of negative results, guarantees that the results published in scientific journals that adhere to those standards will be unreliable. But that doesn't really have anything to do with reproducibility per se, it has to do with the fact that the journals use a weak criterion for defining positive results. That, combined with our human predilection to value positive results over negative ones and the understandable desire of scientists to advance their careers, all but guarantees that journals will contain many defensible but false results. But this is not because of a lack of reproducibility per se, it's because of poor policy choices.

[+] toufka|8 years ago|reply

I'd add a last point that a scientific paper (or a particular experimental result) is not sufficient for 'science'. Citing a paper as either a truth or discovery is a disingenuous. The endeavour of science gets at a kind of consensus by having lots of papers, lots of experiments, lots of additional layers of experiments build on each other. And yes it's annoying to realize a given paper is misleading, has errors, or is even wrong - but that annoyance and workarounds are part of the scientific process. It's actually fairly rare for an experiment or paper to be strong enough to stand completely on its own as a singular, thorough, and completed discovery. If you've been part of the scientific field you know that it yet takes years after even a good publication for the knowledge to accrete to the greater scientific body.

Some of the best advice I got was to spend a lot more time than one might want to really come up with an experimental design that avoided a statistical outcome and instead probed at an either/or mechanism. It's not always possible, but when it is, it can be a powerful reply to fighting statistical battles (see the Michelson-Morely experiment for a classic example).

Further careers, families, reputations, humans are also part of that scientific process, which together with the above require and admit some deviation from perfect efficiency. Which we as a society should demand be okay.

[+] godelski|8 years ago|reply

> 1. No result is 100% reproducible because you can never completely reproduce the conditions of any experiment.

This is why you include the error in your results. We don't care if your two experiments result in 100% the same results. We care about if your results line up within the error of your experiment.

> 2. Even a completely non-reproducible result can be scientifically significant. For example, celestial events are almost never reproducible. Our understanding of celestial mechanics nonetheless rests on solid science.

This is why we share data. That is in essence our observation. A celestial event might be caught by only one instrument, but several people can develop several different models. We wait and watch for similar events though, to check if models are consistent (within error).

> 3. The end-product of science is not truth, it is explanations of observations.

I just wanted to repeat this because it can never be stated enough.

> 4. The statistical tests currently in widespread use as a criterion for publication in peer-reviewed journals guarantee that at least one result in 20 will be due to chance and not because the hypothesis being tested is actually true.

While I don't like p values, because of hacking, (especially 0.05) that's not how stats work. Flipping 10 heads doesn't guarantee 5 tails, not even 1. I wouldn't use as strong as a word as guarantee.

But this is also an argument FOR reproducing. If multiple experiments are consistent with one another (within error) than that strengthens the argument.

TLDR: More brains that look at a problem helps solve the problem.

I will add my own statements about the reproducability problem in science. One is because there is less funding for it. Reproducing an experiment isn't sexy. Another problem just stems from that data isn't always open. It is hard to review work if you don't know everything about the experiment. Data can even have simple mistakes that just weren't caught. But it is also embarrassing to share data.

[+] 3JPLW|8 years ago|reply

> The statistical tests currently in widespread use as a criterion for publication in peer-reviewed journals guarantee that at least one result in 20 will be due to chance and not because the hypothesis being tested is actually true.

That's wildly overly pessimistic. That would only be the case if scientists just went around, looking at the world, and came up with null-hypotheses willy nilly. That's generally not the case. There is often a _reason_ for conducting the test and a plausible cause of action. There are two possible reasons for a significant result:

* The null hypothesis is correct but they got "unlucky" data (5% chance or p% chance)

* There is a real effect (and the null hypothesis is actually wrong)

This becomes more problematic in reproducibility tests, though, since that biases my prior towards a correct null hypothesis and now you must be very careful about pre-registration and the numbers of folks worldwide that are trying to reproduce a given experiment.

[+] yellowstuff|8 years ago|reply

> The statistical tests currently in widespread use as a criterion for publication in peer-reviewed journals guarantee that at least one result in 20 will be due to chance and not because the hypothesis being tested is actually true.

I have a pedantic correction, but a relevant one. If a journal makes a rule that results must have p < .05, and then scientists go off and do a bunch of well-managed science they will get results with a range of p-values. .05 will be a ceiling, not a floor, on the possible p-values. More papers have a p-values right around .05 than they would if results were being produced fairly, so that's actually evidence of bad practices.

[+] lordnacho|8 years ago|reply

> 3. The end-product of science is not truth, it is explanations of observations. Those observations can (indeed must) include non-reproducible ones. Sometimes the explanation of non-reproducible results is "experimental error" or "delusion" or "we just don't know." But non-reproducible events are nonetheless within the purview of science.

Yes and no. I think you have to have predictions in there somewhere. That's why your celestial mechanics is science: whatever the movement is, you were able to say what it was going to be beforehand, even if it only happened once.

As for explanation, that is not necessarily so simple, either. Sure, a lot of things seem to make more sense with a theory, but not every theory makes sense. A lot of stuff in fundamental physics is predicted by the equations, but in what sense are the equations explanations? "Shut up and calculate" comes to mind.

[+] buvanshak|8 years ago|reply

>For example, celestial events are almost never reproducible. Our understanding of celestial mechanics nonetheless rests on solid science.

I am not sure this is correct. The rules that govern celestial bodies is same as the ones that govern objects on earth. So why are they not reproducible? It does not require to measure forces between celestial bodies to measure the value of G. Measuring the forces between two masses on earth is enough.

[+] Waterluvian|8 years ago|reply

We need a journal that exclusively publishes papers with results that were totally contrary to the hypothesis.

So long as the researchers aren't outright inept or fraudulent, what's there to be ashamed about? Those would be some of the more interesting papers to read, in my opinion.

[+] kerkeslager|8 years ago|reply

> 1. No result is 100% reproducible because you can never completely reproduce the conditions of any experiment.

It's worth noting that this is part of the goal of reproducing results. If two identical experiments produce the same result, you're only providing proof for the very narrow theory the experiment tests. Slightly different experiments allow you to demonstrate that the theory is applicable. Imagine if evolution only occurred in the Galapagos, or if gravity only worked in an apple orchard in Cambridge. The only reason we know that evolution and gravity aren't local oddities is that they have both been tested in a wide variety of locations, with a wide variety of experiments.

[+] Erlich_Bachman|8 years ago|reply

4.)

You mean that papers are published with P<0.05?

When did this become a "standard"? Most papers I have seen in respectable journals and other sources would not strive for such a high Pvalue, it would be much lower like 0.001.

[+] khr|8 years ago|reply

Power is also a concern that many researchers do not pay attention to (at least in psychology/neuroscience).

If the original study was under-powered, the estimated effect size in that study will be inflated and any replication attempt that uses this inflated effect size estimate will be severely underpowered.

Plus, two independently conducted studies that are both powered at 80% to detect a true effect will both be positive results only 64% of the time (assuming absolutely nothing fishy going on, e.g. p-hacking).

[+] lurquer|8 years ago|reply

> 3. The end-product of science is not truth, it is explanations of observations.

The term 'science' is used in this way. Perhaps it should not be. Perhaps semantics drive the issue.

In the field in which I am familiar, we have 'expert witnesses' that will use reproducible scientific principles to analyze, for instance, blood scatter to determine the angle in which a shotgun was held when it discharged and blew off a particular person's head. The principles -- based on physics, chemistry, etc -- are sound and scientific.

The problem is, we will invariably have TWO very scientific expert witnesses who BOTH use reproducible and accepted methods. Yet, using their scientific tools, they will come to two utterly different conclusions.

It's left to a jury to decide which interpretation is correct.

While the tools used by the experts are scientific, their opinions are not. While their scientific tools are reproducible (for instance, the speed at which coagulating blood and brain tissue will slide down a stucco wall given all the variables such as temperature, humidity, friction coefficients, etc.) their ultimate reconstruction of what happened is not reproducible, and it is left to lay people to draw their conclusions.

There should be more of a distinction between Science and History. That is, science -- strictly defined -- is all about reproducibility. History -- by definition -- is not. Using scientific tools to reconstruct a past event does not result in a 'scientific' theory. Rather, it results in a historical theory arrived at with scientific tools.

Whether it is blood scatter from a murder (or suicide, depending on whom you ask), the formation of the Grand Canyon, or the development of finch beaks on Galapagos, the ultimate theories are inherently non-scientific as they can not be tested. The methods and tools used to derive the theories can be... but not the conclusion itself.

In short, perhaps the term 'science' is being used for things outside its domain. (Or, alternatively, if we wish to include such things inside the domain, we should broaden the strict definition of science. Like I said, there are some issues of semantics that may be driving some of the issues in the article... other issues, of course, pertain to sloppiness, errors, and the like.)

[+] eli_gottlieb|8 years ago|reply

>The end-product of science is not truth, it is explanations of observations.

That sounds a whole lot like truth to me, or at least, like any sane construal of truth.

[+] Thrymr|8 years ago|reply

To be clear, the "NAS" (nas.org) that published this study is the National Association of Scholars [0], a political group, not the National Academy of Sciences (nasonline.org) [1], a nongovernmental organization that consists of scientists elected by their peers to provide independent scientific advice to the US government. There was in fact a study published recently in PNAS, the Proceedings of the National Academy of Sciences, on this topic [2].

[0] https://en.wikipedia.org/wiki/National_Association_of_Schola...

[1] https://en.wikipedia.org/wiki/National_Academy_of_Sciences

[2] http://www.pnas.org/content/early/2018/03/08/1802324115

[+] mistrial9|8 years ago|reply

Some of the comments miss an important distinction.. Science in the popular press, often refers to science versus completely non-science ways of forming an opinion or deciding policy. Meanwhile, science within a technical community, is subject to human error and manipulation, and relies on a reproducible result, as well as peer-review, to find answers to conflicting claims.

There are certainly non-Science ways of forming an opinion and deciding policy, in many cases totally legitimate. But once science is claimed, then of course it has to be subject to science rigors.

[+] m-watson|8 years ago|reply

Yes, thank you for saying this. I feel like, often, people use science as some ethereal being. It is a process that has to be used. If that process isn't happening it isn't science. That doesn't mean everything has to be science, but things that are science should be held to a certain standard but should also be taken with the extra weight if that standard is met.

[+] WhompingWindows|8 years ago|reply

Never mind the massive debate that occurs within the scientific community itself on some new or controversial claims. When the press gets its hands on the views of one dissenter and one proponent, it makes it seem like there are only sides A and B. This is not as simplistic as politics, where politicians form into immutable groups Left vs Right. We are dealing with numerous camps, and within each scientific camp, there are numerous arguments made for/against an issue.

Take climate change, no doubt a controversial issue. To say it is "controversial" in the political sense would mean Left and Right (in the USA alone) have vociferously different stances on the issue. However, if viewed as a scientific controversy, we are now talking about detailed methodological concerns, like methods of data collection, analysis, kinds of statistical bias, and subtle changes in arbitrary parameters. Any scientist can tweak this or that in their model to make it conform more easily to their preconceived notions about climate change. Unfortunately, we also have some oil industry shills out there who got trotted out as an equally weighted side B to the side A of the dozens of scientists who would generally disagree. Then, you also have anti-science proponents who use the legitimate self-criticism of scientists to attack science as a whole.

It's a sad state of affairs. Science should be reported in the press, but it should also be reported much better than it is. In the USA in particular, STEM education is lagging behind: the average person can't delineate the good science coverage from the bad, and we have ridiculous notions and conspiracies that fail to become filtered out (anti vax, climate change deniers, flat earthers, moon landing was faked, etc.)

For me, reproducibility is one problem in a broader ecosystem of scientific problems, including science education generally, as well as misuse of statistics, and a saddening drive for incremental results at the expense of more broad-based thinking which might lead to fundamental breakthroughs. Our education systems must be reformed to deal with these problems, that's the only way out that I see.

[+] timtadh|8 years ago|reply

In response to several threads here: it is important to distinguish when scientists are self critical vs. when non-scientists are critical of the scientific method. For instance, there is a long history of scientists criticizing how the scientific process is currently conducted for the purposes of improving the scientific endeavor. That work is sometimes used by non-scientists who question the overall scientific method. However, such use is invalid as the scientific self-criticism

1. assumes the validity of the scientific method

2. relies on the scientific method as its critical lens

Whereas those who critique science as a whole:

1. assume that the scientific method does not work and does not arrive at "truth"

2. then use scientists being self critical to prove #1.

Such a "proof" does not work as there is its uses the assumption "the scientific method arrives at truth" to derive the contradiction "the scientific method does not arrive at truth". See for instance comment: https://news.ycombinator.com/item?id=16859200

In reality, work on reproducibility is about improving the practice of science overall. It does not in itself show that science is inherently untrustworthy. What it does show is that scientific discovery is difficult and it takes a lot of effort and new findings should be treated critically. What does critically mean in this context? It means with in the boundaries of science analyzing the theoretical basis, hypothesis, method, and experimental results for potential flaws. It does not mean to be skeptical as a default because science "doesn't work."

[+] 323454|8 years ago|reply

I agree with your overall point, but technically speaking it is logically valid to prove a hypothesis false by first assuming it and then deriving a contradiction, even when the contradiction is the negation of the original hypothesis (as it is in your example).

What you should have said is that some critics start with the premise "the scientific method does not arrive at truth", and then use other people's arguments that depend on the premise "the scientific method arrives at truth" to support their claim, which is indeed logically invalid.

[+] buvanshak|8 years ago|reply

>scientists criticizing how the scientific process is currently conducted for the purposes of improving the scientific endeavor.

I think what happening here is a bit more serious. They are showing a widespread crisis. It is not just some minor feedback to improve the process.

>It does not in itself show that science is inherently untrustworthy.

I think when statistics is involved, the results are inherently untrustworthy. This is not really surprising because there is a whole bunch of ways these studies that involve statistics could go wrong. And we are still finding new ways on how this could go wrong.

Then there are things like publication bias, that takes this to a whole new level. Things like that means that a biased body of journals can project any consensus that it favors just by selecting studies that fit its narrative. The inherent issues with statistics means that you can find studies that shows any possible outcome.

[+] danharaj|8 years ago|reply

Lets take a page from Marx. Science is many things, in particular a relationship between capital and labor. The scientific method is a wonderful idea, but it is subordinate to the economic forces that underlie scientific activity. Look at the conflicts and contradictions between those doing science (labor) and those deciding the science to be done (capital), and that is the ultimate source of these crises.

The executive summary lists 40 points on how to improve the reproducibility of science. A bit over half of them are addressed to the sources of capital such as private organizations, universities, and governments. I think many of those points are good. However, I don't think the other points, the ones that recommend doing science in different ways, have a good punch. Even if you fix the problems there are today, so long as science is a rat race of trying to get grant money to stay afloat while burning out grad student after grad student, I think other pathological practices will creep in as a completely rational response on the parts of scientists to a hostile ecosystem. There's just a very big gap between how science should be done and what capital owners want from science.

[+] JumpCrisscross|8 years ago|reply

> There's just a very big gap between how science should be done and what capital owners want from science

It was "a team from Bayer Healthcare" who "tried to replicate the results of basic cancer studies," failed, "and kicked off a media storm questioning the legitimacy of cancer science—and science in general" [1]. The "capital owners" looking out for their own buck are performing more effectively, narrowly speaking, than academia.

[1] https://www.wired.com/2017/01/fighting-cancers-crisis-confid...

[+] TangoTrotFox|8 years ago|reply

I think this is pretty self evident, but the issue is that let's say you have some system where economic forces are removed. Essentially a researcher basic income in one scenario. This would suddenly massively incentivize people towards this direction since it's basically a career path that guarantees a stable livelihood, which is something that's extremely rare today. Well you need to ensure there's nobody just completely gaming the system and so inextricably you'll end up with some sort of qualifier for results. And now suddenly you've done nothing but kick the can since this new qualifier is what's going to be gamed.

Maybe the biggest problem is what you mention, but in another direction. For whatever reason there seems to be extremely little interest in the private direct funding of science. In times past the aristocracy would often fund scientific research on all sorts of topics. Today the practice seems to have all but disappeared, certainly if we measure relative ratios of the practice.

[+] stevedonovan|8 years ago|reply

While we're talking political economy, I was thinking that academia tends towards a degenerate form resembling subsistence agriculture: defending a small patch of turf and making just enough meaning to support a career. Even if the raw economic incentive is removed (e.g "publish or perish") there is sufficient ego investment to preserve this pattern.

[+] jonmc12|8 years ago|reply

You might enjoy the book, The Fellowship: Gilbert, Bacon, Harvey, Wren, Newton, and the Story of a Scientific Revolution.

The book is set around 1660, as the center of the scientific universe shifted from Italy to England. Interestingly, the formation of the Royal Society occurred in the midst of the English Civil war. Not only economic forces, but military forces and political forces had a dramatic impact on shaping what became the formalization of of the scientific method, and peer review process.

Consider that the very existence of Oxford (and other universities) hung on the arbitrary conceptions of military generals and the political implications of their decisions.

[+] quantumofmalice|8 years ago|reply

The reproducibility crisis is most severe in the social sciences. Hard sciences like physics are on much firmer ground, and conflating the two is clownish. Social science research is funded almost entirely by the government[1].

It's a nice story you've got there, but the reason there is a replication crisis in the softer sciences is not due to the evil capitalists of the marxist imagination. In fact, much of it is due to cultural marxist insistence on outcomes conforming to political correctness, as well as uncritical acceptance of politically conforming work. See the ongoing fight against the idea of a high genetic basis for adult IQ.

All of this is obvious enough: capitalists aren't interested in non-reproducable results unless it is obviating them from blame for a given externality. But that sort of research is a small fraction of overall research (and should absolutely be done by impartial third parties.)

[1] https://en.wikipedia.org/wiki/Funding_of_science

[+] unknown|8 years ago|reply

[deleted]

[+] unknown|8 years ago|reply

[deleted]

[+] spikels|8 years ago|reply

Marx!? Marx can teach us very little about economics and almost nothing about science. Science is NOT “a relationship between capital and labor”.

If you see everything through a such a strong lens, you see very little.

[+] btilly|8 years ago|reply

The article may be good, but the source has a definite bias. See https://www.sourcewatch.org/index.php/National_Association_o... to see what that bias is.

That said, they should be well-informed on the subject. So it probably does contain a lot of good information.

[+] makecheck|8 years ago|reply

There has to be similar prestige/career-building/notoriety/funding for spending time on reproducing the experiments of others. Without that shift there will clearly be a greater tendency to just try something new.

Also, when experiments depend on source code, etc. we need real engineering tools/principles applied. (Something like: “you can’t publish paper X if you aren’t including a public repository with build/run instructions”.) Unfortunately, there are all kinds of reasons why scripts/builds could fail just a few months or years later so they would have to be checked too.

I think it would be cool if document-generation caught on in the publication of papers, i.e. the paper itself is generated by running actual experiment scripts and producing charts, etc. from plain text source.

[+] coldacid|8 years ago|reply

So, Jupyter?

[+] Retric|8 years ago|reply

Being reproducible is only critically important if people treat individual studies as meaningful.

That IMO is a far more dangerous stance. Any study can have hidden flaws, none should be trusted without some form of replication.

[+] quotemstr|8 years ago|reply

Are you claiming that reproductively is unimportant when multiple studies cover the same area? Do you think so because you imagine errors in the studies would be uncorrelated? Not the case: look at the social sciences. Errors are very much coordinated.

[+] rplst8|8 years ago|reply

How often is a study even repeated in the course of normal scientific research? I think most studies are focused on expanding the research and therefore knowledge about the science being studied. Accepting previous studies as fact is dangerous and could easily lead to a house of cards scenario. I think this is especially true in the very specific, niche studies that are common in today's highly competitive graduate student and publish or die research landscape.

[+] leereeves|8 years ago|reply

The issues mentioned in the introduction affect entire fields, not just individual studies:

> Improper use of statistics, arbitrary research techniques, lack of accountability, political groupthink, and a scientific culture biased toward producing positive results

[+] tonystubblebine|8 years ago|reply

We did an article trying to help people be better critical thinkers when they hear about psychology research. I thought the author did a good job getting into how these studies go wrong and also the types of magical results you should be suspicious of. https://medium.com/@jhreha/most-psychology-research-is-bs-73...

[+] SubiculumCode|8 years ago|reply

We know that weak classifiers can be bagged to produce a strong classifier al la Adaboost.

Each study is a weak classifier and would have a 'reproducibility crisis' if retested on new data. However after lots of studies of similar phenomena, a strong classifier emerges. In the field, we call this converging lines of evidence.

[+] arafa|8 years ago|reply

What I'd like to know is if the "irreproducibility crisis" is really some combination of "sample size was too small" and "effect size was too small". When I went through a lot of these studies myself and saw the ones that don't reproduce, I saw this theme over and over. "P-hacking" is less of a concern to me when the effect is real and widespread.

It's so bad now that for any article/study, I look at the sample size first. If it's too small (especially < 100) or they don't say I just ignore it. And if you don't publish or give some estimate of your effect size I just think about it directionally but don't give it much weight mentally.

[+] air7|8 years ago|reply

A good solution would be to simply reduce the p-value threshold.

The current "standard" p-value in many fields, of 5%, is arguably too high. consider throwing a double six in backgammon has probability of 3%. That means that, on some level, throwing 6-6 would be a valid scientific "proof" of ESP. (p-value<0.05) This of course is even before p-hacking.

A high p-value is basically externalizing part of the research cost onto other scientists and in the process creates a lot of false-positive "noise".

[+] JohnL4|8 years ago|reply

I think Planet Money did an episode on this a few years ago. Good stuff, and maybe easier than reading a paper. :)

https://www.npr.org/sections/money/2018/03/07/591213302/epis...

[+] Afforess|8 years ago|reply

The report (not a paper) is actually written for a lay audience and very readable: https://www.nas.org/images/documents/irreproducibility_repor...

[+] terminado|8 years ago|reply

I mean, this is still a good thing, since it charts a map of decidedly grey areas, where information may always be ambiguous, and useful information needs to be sussed out carefully.

It's better than presumptively assuming that "Science" is an infallible always black-and-white.

[+] rdlecler1|8 years ago|reply

Maybe papers need Yelp reviews on reproducibility. 1-Star couldn’t reproduce.

[+] mathgenius|8 years ago|reply

I have met several "refugees" from particle physics, that left in part because of flimsy statistical methods. So let me ask: who is going to reproduce the experiment that found the Higgs boson?

[+] gringoDan|8 years ago|reply

One of Slate Star Codex's top all-time articles discusses this very issue. Highly recommend: http://slatestarcodex.com/2014/04/28/the-control-group-is-ou...

[+] jnordwick|8 years ago|reply

> On the meta-level, you’re studying some phenomenon and you get some positive findings. That doesn’t tell you much until you take some other researchers who are studying a phenomenon you know doesn’t exist – but which they themselves believe in – and see how many of them get positive findings. That number tells you how many studies will discover positive results whether the phenomenon is real or not. Unless studies of the real phenomenon do significantly better than studies of the placebo phenomenon, you haven’t found anything.

This is such an astute observation that reproduction studies can be used to find a placebo control group for bad scuence. Parapsychology was one such group, but now we can find many others. Brilliant.

[+] blueprint|8 years ago|reply

"In order to be a true scientist, you should be familiar with philosophy, first"

[+] flamedoge|8 years ago|reply

There is the problem of conflating science with mathematics.

[+] ajarmst|8 years ago|reply

Judging by the sheer amount of XKCD content in this report, Randall Munroe should get a co-author credit.

251 comments