Scientists rise up against statistical significance

[+] krisrm|7 years ago|reply

Maybe I'm just being jaded, and I'm certainly not a researcher or statistician, but I don't see how removing "statistical significance" from scientific nomenclature is going to prevent lazy readers (or science reporters) from trying to distill a "yes/no" or "proven/unproven" answer from P values listed in a complex research paper.

[+] bunderbunder|7 years ago|reply

Well, that's why the article doesn't propose simply ditching P-values, it proposes reporting confidence intervals instead. Not only to they provide more information (by simultaneously conveying both statistical and practical significance) they're also easier to interpret correctly without special training.

[+] colechristensen|7 years ago|reply

If you eliminate p-value then you can't have authors that search for anything with p<0.05 and then publish, there will simply have to be some other justification. If p-value is gone it will have to be replaced with something and that something, the supposition is, will result in better science.

Writing a paper, you need to support your conclusion, removing p-value doesn't remove that need for support, it will just find something different, hopefully better.

[+] nabla9|7 years ago|reply

"Lazy reader" is not the audience for scientific papers.

Protecting against misinterpretations by outsides is not something that scientific research papers should worry about.

[+] mar77i|7 years ago|reply

If anything, getting rid of some term because we currently find it insufficient to convey what we mean by it will, in all likelyhood, open the race for far more confusing, rosy and well-meaning, yet more meaningless nomenclature.

By that measure, I find this entire idea rather idiotic.

[+] Fomite|7 years ago|reply

Note that there are already journals that effectively do this (it is very hard to get a p-value put into the journal Epidemiology for example) and as far as I can tell, there are few if any negative repercussions evident.

[+] unknown|7 years ago|reply

[deleted]

[+] whatshisface|7 years ago|reply

Although my failure to see an elephant on the table does not rule out completely that there could be an elephant there, it does limit the possible size of the elephant to a few micrometers. Failure to reject the null hypothesis does in fact provide evidence against the other possibilities, so long as "other possibilities" are understood to mean "other possibilities with big effects."

I don't see why a scientist at a conference who's saying that two groups are the same has to be heard as claiming, "we have measured every electron in their bodies and found that they have the same mass, forget about six sigma, we did it to infinity." Instead they could simply be understood to be saying that the two groups must be similar enough to not have ruled out the null hypothesis in their study.

[+] rfeather|7 years ago|reply

That's the thing. P values don't prove that anything must be. They simply say that if rerunning the experiment again, it would be surprising to get a different result. Conversely, if you don't find "statistical significance" it definitely doesn't mean there isn't a difference. In practice, it might (often) mean the study didn't have enough samples to find a relatively small effect, but the layperson making decisions (do I allow right turn on red or is that dangerous?) may not get that nuance. A book that really helped clarify my thinking on this is _Statistics Done Wrong_ by Alex Reinhart.

Edit: remove "interpret" from last sentence to clarify

[+] mjburgess|7 years ago|reply

It doesn't limit the size of the elephant.

Any observation is consistent with an infinite number of models. Eg., your sight is defective, ie., in many cases: your sample is biased, not big enough, etc.

And that A correlates with B, or fails to, to "some significance" is consistent with any causal relationship between A and B.

> Surveys of hundreds of articles have found that statistically non-significant results are interpreted as indicating ‘no difference’ or ‘no effect’ in around half (see ‘Wrong interpretations’ and Supplementary Information).

[+] rwj|7 years ago|reply

This thread is a prime example of the danger of p-values, which measure the likelihood of the data if you assume the null hypothesis. This is very different than the probability that the hypothesis is true.

[+] ummonk|7 years ago|reply

Statistical significance doesn't communicate expected bounds on possible effect size though. If the two sigma bounds are hundreds of meters, the failure to see an elephant in the table is completely meaningless. On the other hand, if it's a few micrometers, that tells you a lot.

[+] skybrian|7 years ago|reply

Yes, failure to rule out an elephant means you probably didn't look hard enough (collect enough data).

But successfully ruling out an elephant is uninteresting if you didn't expect an elephant. The problem is that "statistically significant" sounds impressive, but we shouldn't be impressed.

I guess we need a less impressive term for this? Maybe something like "may have avoided statistical blindness."

[+] BenoitEssiambre|7 years ago|reply

>They could simply be understood to be saying that the two groups must be similar enough to not have ruled out the null hypothesis in their study

The groups need not be similar to fail to rule out the null. It can also be that the measurements are too noisy and too few.

Also on the flip side, if you do reject the null, it doesn't mean the the groups are different. It could also be that you have so many measurements that you are picking up tiny biases in your experiment or instruments.

Null hypothesis testing is almost always too weak of a test to be useful.

[+] 6gvONxR4sf7o|7 years ago|reply

Sorta, but people have taken that reasoning and (by this analogy) begun looking for elephants with their eyes closed. In that case, the failure to see an elephant doesn't actually tell you much of anything.

[+] i_phish_cats|7 years ago|reply

I predict nothing will change. Flaws in p-values and confidence intervals have been apparent since almost their inception. Jaynes spoke out against it strongly from the 60's on (see, for example, his 1976 paper "Confidence Intervals vs Bayesian Intervals"). Although I can't find it right now, there was a similar statement about p-values from a medical research association in the late 90's. It's not just a problem of misunderstanding the exact meaning of p-values either. There are deep rooted problems like optional stopping which render it further useless.

The problem is that with all its problems, statistical significance provides one major advantage over more meaningful methods: it provides pre-canned tests and a number (.05, .01, etc) that you need to 'beat'. The pre-canned-ness/standardization provides benchmarks for publication.

I once worked in a computational genomics lab. We got a paper into PNAS by running fisher-exact test on huge (N=100000+) dataset, ranked the p-values, got the lowest p-values, and reported those as findings. There's so much wrong with that procedure its not even funny.

[+] abecedarius|7 years ago|reply

Hippocratic medicine lasted well into the 19th century, centuries after the scientific revolution. There'd been critics correctly calling it an intellectual fraud before then. You could've taken this as proof that no force on Earth could drag medicine into modernity, but it did sort of happen, as it became public, common knowledge that doctors were harming more people than they helped. They did start cleaning up their act (literally) though it took a long time and I think they're still collectively irrational about chronic conditions.

I hope we aren't worse at reform than they were in the 1800s.

[+] Fomite|7 years ago|reply

Working in the field, it is getting better. It's slow, but getting better.

[+] randcraw|7 years ago|reply

As I recall, instead of "compatibility intervals" (or confidence intervals), other gainsayers of P tests have proposed simply making the existing P criterion more selective, like a threshold value of .01 rather than .05, which equates to increasing the sample size from a minimum of about 10 per cohort to 20 or more.

I suspect this will be the eventual revision that's adopted in most domains, since some sort of binary test will still be demanded by researchers. Nobody wants to get mired in a long debate about possible confounding variables and statistical power in every paper they publish. As scientists they want to focus on the design of the experiment and results, not the methodological subtleties of experimental assessment.

[+] ordu|7 years ago|reply

Raising threshold will not just reduce probability of false positive result, but also will raise probability of false negative. Social sciences are dealing with a complex phenomena and it maybe that there are no simple hypothesis like A -> B, that describes reality with p<0.05. While in reality A causes B, just there are C, D, ..., Z, and some of them also causes B, others works other way and cancel some of others. And some of them works only when Moon is in the right phase.

p<0.01 is good when we have a good model of reality which generally works. When we have no good model, there are no good value for p. The trouble is all the hypotheses are lies. The are false. We need more data to find good hypotheses. And we think like "there are useful data, and there are useless, we need to collect useful, while rejecting useless". But we do not know what data is useful, while we have no theory.

There is an example from physics I like. Static electricity. Researchers described in their works what causes static electricity. There was a lot of empirical data. But all that data was useless, because the most important part of it didn't get recorded. The most important part was a temporality of phenomena. Static electricity worked some time after charging and then discharged. Why? Because of all the materials are not a perfect insulators, there was a process of electrical discharge, there was voltage and current. It was a link to all other known electical phenomena. But physicists missed it because they had no theory, they didn't knew what is important and what is not. They chased what was shiny, like sparks from static electricity, not the lack of sparks after some time.

We are modern people. We are clever. We are using statistics to decide what is important and what is not. Maybe it is a key, but we need to remember that it is not a perfect key.

[+] mcguire|7 years ago|reply

The example from the article about the two drug studies seems to indicate that would not be useful.

> For example, consider a series of analyses of unintended effects of anti-inflammatory drugs2. Because their results were statistically non-significant, one set of researchers concluded that exposure to the drugs was “not associated” with new-onset atrial fibrillation (the most common disturbance to heart rhythm) and that the results stood in contrast to those from an earlier study with a statistically significant outcome.

> Now, let’s look at the actual data. The researchers describing their statistically non-significant results found a risk ratio of 1.2 (that is, a 20% greater risk in exposed patients relative to unexposed ones). They also found a 95% confidence interval that spanned everything from a trifling risk decrease of 3% to a considerable risk increase of 48% (P = 0.091; our calculation). The researchers from the earlier, statistically significant, study found the exact same risk ratio of 1.2. That study was simply more precise, with an interval spanning from 9% to 33% greater risk (P = 0.0003; our calculation).

[+] AlexCoventry|7 years ago|reply

> Nobody wants to get mired in a long debate about possible confounding variables and statistical power in every paper they publish

I know, right? God forbid that we take a close look at the ways we might be fooling ourselves. Sounds like hard work.

[+] jrochkind1|7 years ago|reply

The OP spends some time on a point that the threshold is fairly arbitrary, and the problem is misinterpreting what it actually _means_ for validity and other conclusions.

I suspect just changing the threshold (especially as a new universal threshold, rather than related to the nature of the experiment) wouldn't even strike the authors as an improvement.

> Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention. It is based on the false idea that there is a 95% chance that the computed interval itself contains the true value, coupled with the vague feeling that this is a basis for a confident decision.

[+] rwj|7 years ago|reply

Part of the problem is that p-values are not the best indicator in all applications. One question in my work is whether a process change affects the yield. A confidence interval of (-1%,+1%) is much different than (-20%,+20%), even though they would look the same if I was just interested in the p-value. We might also accept changes with a (-1%,+10%) confidence interval. We can't 'prove' that yields would increase, but there is significantly more upside than downside.

[+] twoslide|7 years ago|reply

I agree significance is mis-used, but in the opposite way than these authors. They are concerned that authors claim "non-significant" means "no effect," I see a lot of authors claiming "significant" means "causal effect." They don't account for the consequences of running multiple tests, and of endogeneity.

Differences between means of any two groups (e.g. treatment and control) on any outcomes will tend be non-zero. Interpreting this sample difference as a population difference without considering confidence interval seems risky.

[+] mshron|7 years ago|reply

I gave a relevant talk a few years ago: How to Kill Your Grandmother with Statistics[1].

The authors are spot on that the problem is not p-values per se but dichotomous thinking. People want a magic truth box that doesn’t exist. Unfortunately there are a ton of people in the world who continue to make money off of pretending that it does.

https://www.youtube.com/watch?v=iRpAHS5_hDk

[+] tomrod|7 years ago|reply

What happens when your model errors aren't normally distributed?

If the kurtosis is high, p-values are over-stated. If fat-tailed then p-values are understated.

Why? Because the likelihood of your p-value isn't guaranteed to be normally distributed.

Normal is a nice assumption but asymptotic can take a long time to kick in. The CLT is beautiful analytically, but fortunes are made from people who assume it.

[+] tzhenghao|7 years ago|reply

Lack of statistical literary is a huge problem these days. As the modern workforce trend towards more analytical methods, statistics can be used as a weapon to fool and bend the truth.

I'm frankly tired seeing executives going on stage trying to show some numbers and graphs to prove a point on some variables. You see this in board meetings too. The sample sizes are too small to conclude anything significant about it!

[+] chasedehan|7 years ago|reply

> When was the last time you heard a seminar speaker claim there was ‘no difference’ between two groups because the difference was ‘statistically non-significant’?

I'm an Economics PhD (And former professor) and if someone were to say those lines at an academic conference there is a high likelihood that they would be literally laughed at.

Maybe it is because of my background in a quantitative field where we place a huge emphasis on statistical rigor, but t-tests were pretty much dismissed by anyone serious 20+ years ago. Seems like the issue stems to those disciplines without a stats/math background to just point to t-stats. My wife reads medical literature for her work and I gag pretty close to every time she asks me to look at the results.

[+] Fomite|7 years ago|reply

Yeah - I'm an epidemiologist, and the medical literature is painful. "Statistical curmudgeon" is, as far as I can tell, my primary role as peer reviewer.

[+] rossdavidh|7 years ago|reply

The fundamental problem here seems to be, that you cannot get around the need for a statistician (or someone from another field who has a similarly deep understanding of statistics) to look at the data. There is no shortcut for this, but we simply do not, as a society, have enough people with statistical knowledge sufficient to the task.

There is not, I suspect, any other solution but that we must train a whole lot more statisticians. This means we will need to give more credit, and authority, and probably pay, to people who choose to pursue this field of study.

[+] fifnir|7 years ago|reply

Yet nature forces you to provide p values and n count for everything you can in any figure as if that's enough to guarantee significance of results.

We need to start publishing with transparent and reproducible code from raw data to figure. Show me the data and let me make my own conclusions.

It's not too hard,I'm writing my phd thesis and every figure is produced from scratch and placed in the final document by a compilation script. My jupyter notebooks are then compiled in pdf and attached in the thesis document as well. Isn't this a better way of doing the "methods" section?

[+] bitxbit|7 years ago|reply

https://wolfweb.unr.edu/homepage/zal/STAT758/Granger_Newbold...

was written over 45 years ago. Granger is rolling over in his grave every time someone "discovers" a magical relationship between two time-series. In all honesty, statistics is hard and it's something you need to practice on a regular basis.

[+] unknown|7 years ago|reply

[deleted]

[+] kbutler|7 years ago|reply

Statistical significance is required, but not sufficient to prove an effect. Lack of statistical significance means you did not prove an effect, but you also didn't prove there is no effect.

So the answer is more likely "statistical significance and more" rather than "ditch statistical significance".

[+] joe_the_user|7 years ago|reply

This:

When we're talking about how to take data as implying X, what is needed is: [logical reason to believe position, how the data was chosen to not bias the whole process, etc] + [data above threshold].

The data that a scientist gets "lives" inside one or another experimental box, some area. But unless the scientists also takes into account how that box and that data came to be, the scientist cannot make any definitive statement based on the properties of just the data.

The statement "Correlation does not [automatically] imply causation" and "Extraordinary claims require extraordinary evidence" both reflect this.

[+] BenoitEssiambre|7 years ago|reply

It's such a weak test that while you may use it informally, it should rarely be included in publications.

[+] headsoup|7 years ago|reply

I think maybe you didn't read the article. This is addressed constantly through it.

Or are you just summarising?

[+] qwerty456127|7 years ago|reply

I have always been saying that if an experimental medication application resulted in a useful effect observed in 1 subject out of 1000 this doesn't mean it's garbage and should be dismissed at this point, it can perfectly mean that one person was different in the same way 1 out of every other 1000 people is and 0.1% of the earth population is 7.55 million people still worth curing.

[+] safgasCVS|7 years ago|reply

A few points:

- The basis of a p-value is very much aligned with the scientific process in that you arent trying to prove something 'is true' rather you're trying to prove something false. Rejection of p-values / hypothesis testing is a bit like rejecting the scientific method. I am lucky enough to be friends with one of the physicists that worked on finding the Higgs Boson and he hammered it into my head that their work was to go out of their way to prove the Higgs Boson was a fluke - a statistical anomaly - sheer randomness. This is a very different mentality to trying to prove your new wunder-drug is effective - especially when those pesky confidence intervals get in your way of a promotion or a new grant. Its much easier to say p-values are at fault.

- Underpinning p-values are the underlying distributional assumption that makes up your p-value needs to match that of whatever process you're trying to test else the p-values become less meaningful.

- The 5% threshold is far too low. This means at least 5% of published papers are reporting nonsense and nothing but dumb luck (even if they got lucky with the distribution). If the distributional assumptions arent met then its even higher. Why are we choosing 5% threshold for a process/drug that can have serious side-effects?

- p-value hacking. So many sneaky ways to find significance here. Taleb goes into some detail into the problem of p-values here https://www.youtube.com/watch?v=8qrfSh07rT0 and in a similar vein here https://www.youtube.com/watch?v=D6CxfBMUf1o.

Doing stats well is hard and open to wilful and naive abuse. The solution is not to misuse or throw away these tools but to understand them properly. If you're in research you should think of stats as being part of your education not just a tickbox that is used validate whatever experiment you're doing

[+] vharuck|7 years ago|reply

It definitely needs to be left out of anything with non-statisticians in the intended audience. I've started leaving it out of most reports. If I write about a difference, it's statistically significant. The test just gives me confidence to write it.

[+] sidesentists|7 years ago|reply

As someone who does a lot of meta-analyses I'd prefer you left in non-significant values as well, if they bear on the hypotheses at hand. Aggregating over nonsignificant effect sizes can still result in an overall effect that is significant.

[+] jknz|7 years ago|reply

From Brad Efron in [1]: "The frequentist aims for universally acceptable conclusions, ones that will stand up to adversarial scrutiny. The FDA for example doesn’t care about Pfizer’s prior opinion of how well it’s new drug will work, it wants objective proof. Pfizer, on the other hand may care very much about its own opinions in planning future drug development."

Significance requirements should be approached differently depending on the use-case. The above are two extreme cases: FDA authorized a new drug where significance guarantees should be rigorously obtained beforehand, and at the other extreme, exploratory data-analysis inside a private company, where data-scientists may use fancy priors or unproven techniques to fish for potential discoveries in the data.

Now how much significance guarantee should be required from a lab scientist is unclear to me. Why not let lab scientists publish their lab notebook with all experiments/remarks/conjectures without any significance requirement? The current situation looks pretty much like this anyway with many papers with significance claims that are not reproducible.

We should ask the question of how much the requirement of statistical significance hinders the science exploratory process. Maybe the current situation is fine, maybe we should new journals for "lab notebooks" with no significance requirements, etc.

On the other hand, in the mathematical literature, wrong claims are published often, see [2] for some examples. But mathematicians do not seem to as critical of this as the public is critical of non-reproducible papers in life sciences. Wrong mathematical proofs can be fixed, wrong proofs that can't be fixed sometimes still have a fruitful argument in it that could be helpful elsewhere. More importantly, the most difficult task is to come up with what to prove; if the proof is wrong or lacks an argument it can still be pretty useful.

[1]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.179...

[2]: https://mathoverflow.net/questions/35468/widely-accepted-mat...

[+] duxup|7 years ago|reply

As a layman who probabbly didn't understand that whole article I ask:

If "statistical significance" is just sort of an empty phrase used to dismiss or prove something somewhat arbitrarily. Then isn't the same person writing the same study likely to be just as arbitrary declaring what is or isn't significant .... anyway?

[+] S_A_P|7 years ago|reply

I feel like this is one area where clickbait media has pushed things backwards. Everyone wants the clicks so facts from studies get skewed into binary results when its most always shades of gray. If I see a study and it shows that you may be slightly less likely to get alzheimers if you drink green tea every day, but only on the order of half a percent or so, I dont have a magic cure all to alzheimers. But you will see news headlines "Green Tea cures alzheimers! and may even be effective for ED!" Maybe we shouldnt rise against statistical significance and push back on incorrect dissemination of the results?

[+] sgt101|7 years ago|reply

It's interesting that they talk about a category error around "no association". In fact there is a category error in applying statistical thinking in cases where objects are not comparable - like human metabolisms, ecosystems, art...

249 comments