top | item 8897367

The Statistical Crisis in Science

73 points| jonathansizz | 11 years ago |americanscientist.org | reply

35 comments

order
[+] yummyfajitas|11 years ago|reply
So at least two people reading this seem to think it's about using science in the context of their pet peeves. It's not.

It's about using a statistical test for a data dependent hypothesis and interpreting the test as if it were used for a data-independent hypothesis. That's all.

It's not about using statistics in politics or finance. It's about first looking at the data, then formulating a hypothesis, then running a standard test which is based on the idea that you chose the hypothesis independently of the data. This is a problem in any field.

[+] tansey|11 years ago|reply
Indeed!

I have a particularly relevant horror story. For one of my graduate classes, I built a game where two people (a liar and a truth teller) would try to convince a third person (the judge) they had experienced something. The goal was to see if people used different language when lying than they did when telling the truth. It was a fun project and we were focused more on the NLP/ML side of things. Turns out to be reasonably possible to separate our ~500 examples via a linear SVM, with some interesting separating words. That's where our claims stopped.

Then I took the results to a psychology prof. They loaded the data into SAS or something and it proceeded to perform HUNDREDS of independent t-tests. The results came out in a few seconds and the professor exclaimed "Oh look! Pronouns are statistically significant! Oh and possessive nouns too!" -- I cringed.

The flip side of this is that as statisticians, we do know how to handle different kinds of testing. If you're going to be looking at all these different outcomes, that's fine, but you just need to correct for it. We've had Bonferroni correction since the 1960s; Benjamini-Hochberg and related false discovery rate methods have been around for almost 20 years now. In fact, there are even situations where the data can help define a prior for your hypothesis testing [1].

Lastly, there is this quote from the article:

> There is no statistical quality board that could enforce such larger analyses—nor would we believe such coercion to be appropriate

I'm not sure if that's such a bad idea. In theory, that should be the job of the journal reviewers and editor. In practice, it's often the blind leading the blind, with confirmation bias thrown in to boot. Maybe we need an SRB as a companion to IRB.

[1] http://arxiv.org/abs/1411.6144

[+] noahl|11 years ago|reply
So actually, I think the idea that data-dependent hypotheses are bad is fundamentally wrong, and is based on a misunderstanding of probability.

The reason you'd avoid data-dependent hypotheses is simple: if the data comes from a process with some sort of randomness in it, then there will usually be things that appear to be interesting patterns but are in fact random artifacts. If you look at your data, you may be tempted to formulate a hypothesis based on these random artifacts. It may pass statistical tests (because the data you have does contain the pattern), but it is not, in fact, causal. To avoid this, you maintain the discipline of only making hypotheses before you look at your data, which means that you can't see a random effect and then guess that it's real.

The problem is, this doesn't mean that if your hypothesis passes a statistical test, the result must be causal. It only lowers the probability - there is still a chance that your hypothesis was wrong, but your data happens to contain a random fluctuation that makes it look right. The only way to protect against this danger is to continuously gather data and re-evaluate your hypotheses, while understanding that there is always some probability that the effect you think you see is really random noise.

And once you're doing this continuous monitoring anyway, then there's no reason to reject data-dependent hypotheses. By definition, if the effect you're seeing is a random occurrence, then it should go away with more data. If it doesn't go away, then maybe you've found something that you wouldn't have been able to guess in advance, which is good! And if you see a random effect, form a hypothesis that passes some test, and then assume that your hypothesis must be true, then the problem is not your data, but rather that you misunderstand how probability works.

In short, avoiding data-dependent hypotheses is a hack that only reduces the probability of an error that you should be avoiding entirely anyway. Once you accept this and start avoiding the error, there's no reason to avoid data-dependent hypotheses, and they can be quite useful.

[+] lrei|11 years ago|reply
I upvoted your comment but a problem in "any field" seems too strong a statement.

AFAIK this isn't much of a problem in CS (my field) and never heard math or physics people complaining...

Seems to only be problem in Social "Sciences" & Bio/Med where many (most?) results are statistical significance tests.

[+] anon4|11 years ago|reply
Was about to post this. Thanks.
[+] dschiptsov|11 years ago|reply
Not only errors and misuse of statistics and misapplying of probability theory, but also abstract modeling in general.

The very idea of modeling dynamic abstract processes such as finance markets, which itself are mere abstractions is a non-science, it is misuse of pseudo-scientific methods and mathematics, and what we have seen so far is nothing but failures.

Too abstract or flawed abstractions and wrong premises cannot be fixed by any amount of math or modeling. They only has to be discarded.

The famous "subject/object" false dichotomy in philosophy is the good example too. People could spent ages modeling reality using non-existent abstractions.

Today all these multiverse "theories" are mere speculations about whether Siva, Brama or Visnu is the most powerful, forgetting that all these were nothing but anthropomorphic abstractions of the different aspects of one reality.

The notion that so-called "modern science" is a new religion (a contest of unproven speculations) is already quite old.

btw, a good example of the reductionist mindset (instead of pilling up abstractions) could be the Upanishadic reduction of all the Gods to one Brahman, to which Einstein accidentally discovered a formula - E = mc2, where c is a constant, implying that there is no time in the Universe).

[+] hessenwolf|11 years ago|reply
You are throwing the baby out with the bath water, with respect to financial modelling. Yes, there are failures, and, yes, the models are severely imperfect.

We reduced the risk on a portfolio of 2 billion Euro, from a about a billion Euro to a risk of about 50 million Euro using hedging. The remaining 50 million was mostly basis risk, i.e., the mismatch between the underlying instruments in the liabilities and the hedge assets.

Using a similar logic to yours, senior management argued that we introduced a new risk called basis risk by trading derivatives.

[+] chriswarbo|11 years ago|reply
Scientists have tried over at least the past few hundred years (depending on your definitions) to build, from scratch, a perspective on the world which is as free from human bias as possible. At the moment, the jewel in the crown is quantum physics: an inherently statistical theory, so detached from human biases and assumptions that many smart people have struggled to understand or accept it, despite its incredible predictive power.

At the heart of the whole process is statistical inference: generalising the results of experiments or observations to the Universe as a whole. A "statistical crisis in science" would be terrible news. We may have been standing on the shoulders of the misinformed, rather than giants. Our "achievements", from particle accelerators to nukes and moon rockets, could have been flukes; if the underlying statistical approach of science was flawed, the predicted behaviour and safety margins of these devices could have been way off. We may be routinely bringing the world to the edge of catastrophe, if we don't understand the consequences of our actions.

Oh wait, it seems like some "political scientists" have noticed that their results tend to be influenced by external factors. I hope they realise the irony in their choice of examples:

> As a hypothetical example, suppose a researcher is interested in how Democrats and Republicans perform differently in a short mathematics test when it is expressed in two different contexts, involving either healthcare or the military.

The article criticises scientists' ability to navigate the statistical minefield of biases, probability estimates, modelling assumptions, etc. in a world of external, political factors like competitive funding, positive publication bias, etc. and they choose an example of measuring how political factors affect people's math skills!

To me, that seems the sociological equivalent of trying to measure the thermal expansion of a ruler by reading its markings. What do you know, it's still 30cm!

[+] semi-extrinsic|11 years ago|reply
Saying that quantum mechanics is an inherently statistical theory is a blatant misrepresentation. Precisely the point that makes QM so weird is that it is not caused by statistics. In a (properly set up) double slit experiment, a single electron is simultaneously travelling through both slits and causing an interference pattern.
[+] chuckcode|11 years ago|reply
"all models are wrong, but some are useful." - George Box [1]

George Box expressed early my general feeling about statistics, it is a very useful tool but remember the limitations of the methods, the data, and the people applying them. I would like to seen an emphasis on openness and transparency with data so others can replicate the analysis and the community can come up with ways to make best practices accessible to anyone.

[1] http://en.wikiquote.org/wiki/George_E._P._Box

[+] SaberTail|11 years ago|reply
A good (in my opinion) trend in physics in the past decade or two has been the rise of "blind" analyses[1]. Basically, the entire analysis is predetermined, before looking at the data. Once all the details are nailed down and everyone agrees with the approach, the blinds are taken off. There's no room for "p-hacking".

This has some disadvantages, though. It requires a good understanding of the experiment so that you can figure out what an analysis will actually tell you. It's difficult to do a blind analysis on a brand new apparatus, since there can always be unanticipated problems with the data. As an example, one dark matter experiment invited a reporter to their unblinding. At first, it looked like they'd detected dark matter, but then they had to throw out most of the events because they were due to unanticipated noise in one of the photomultiplier tubes[2].

[1] http://www.slac.stanford.edu/econf/C030908/papers/TUIT001.pd... is a quick review.

[2] http://www.nytimes.com/2011/04/14/science/space/14dark.html

[+] amelius|11 years ago|reply
I really don't understand the meaning of this sentence (below). Perhaps somebody could explain?

> As a hypothetical example, suppose a researcher is interested in how Democrats and Republicans perform differently in a short mathematics test when it is expressed in two different contexts, involving either healthcare or the military.

[+] zeroxfe|11 years ago|reply
As yummyfajitas said, this is just an example of how the actual issue manifests, but here's what that sentence means:

Democrats and Republicans have their own biases. These biases may be skew their thought processes and make them perform differently in mathematics tests which are worded differently. For example, a Republican may (unconciously) overshoot the numbers for a question about healthcare costs, or a Democrat for a question about military expenditure. Although this shouldn't happen, since mathematics tests are quite rigorously worded, it might, and a researcher is interested in investigating further.

[+] jmmcd|11 years ago|reply
> In general, p-values are based on what would have happened under other possible data sets. As a hypothetical example, suppose a researcher is interested in how Democrats and Republicans perform differently in a short mathematics test when it is expressed in two different contexts, involving either healthcare or the military. [...] At this point a huge number of possible comparisons could be performed, all consistent with the researcher’s theory. For example, the null hypothesis could be rejected (with statistical significance) among men and not among women—explicable under the theory that men are more ideological than women.

The meaning of a p-value is expressed in terms of what would have happened with a different data set, yes, but that different data set would have arisen through a different random sampling from the population. The explanation above seems to completely misunderstand the issue.

[+] CurtMonash|11 years ago|reply
Between the failings in statistics and those in modeling, there's a whole lot of science that's on shaky ground.