Why I've lost faith in p values

[+] davidxc|8 years ago|reply

Here's a more simple thought experiment that gets across the point of why p(null | significant effect) /= p(significant effect | null), and why p-values are flawed as stated in the post.

Imagine a society where scientists are really, really bad at hypothesis generation. In fact, they're so bad that they only test null hypothesis that are true. So in this hypothetical society, the null hypothesis in any scientific experiment ever done is true. But statistically using a p value of 0.05, we'll still reject the null in 5% of experiments. And those experiments will then end up being published in scientific literature. But then this society's scientific literature now only contains false results - literally all published scientific results are false.

Of course, in real life, we hope that our scientists have better intuition for what is in fact true - that is, we hope that the "prior" probability in Bayes' theorem, p(null), is not 1.

[+] taneq|8 years ago|reply

> But statistically using a p value of 0.05, we'll still reject the null in 5% of experiments. And those experiments will then end up being published in scientific literature. But then this society's scientific literature now only contains false results - literally all published scientific results are false.

The problem with this picture is that it's showing publication as the end of the scientific story, and the acceptance of the finding as fact.

Publication should be the start of a the story of a scientific finding. Then additional published experiments replicating the initial publication should comprise the next several chapters. A result shouldn't be accepted as anything other than partial evidence until it has been replicated multiple times by multiple different (and often competing) groups.

We need to start assigning WAY more importance, and way more credit, to replication. Instead of "publish or perish" we need "(publish | reproduce | disprove) or perish".

Edit: Maybe journals could issue "credits" for publishing replications of existing experiments, and require a researcher to "spend" a certain number of credits to publish an original paper?

[+] wrp|8 years ago|reply

One of the best articles covering this issues is Meehl[1][2]. You can find discussion in various places like Gelman[3] and Reinhart[4].

[1] Meehl, Paul E (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66(1), 195–244.

[2] http://meehl.umn.edu/files/144whysummariespdf

[3] http://andrewgelman.com/2015/03/23/paul-meehl-continues-boss...

[4] https://www.refsmmat.com/notebooks/meehl.html

[+] tprice7|8 years ago|reply

'The fundamental problem is that p values don't mean what we "need" them to mean, that is p(null | significant effect).'

From Bayes' theorem, this more useful probability is given by p * x, where x = p(null) / p(significant effect). Maybe we could just lower the accepted threshold for statistical significance by several orders of magnitude so that, for statistically significant p, p * x is still small even for careful (i.e. big) estimates of x (e.g. maybe a Fermi approximation of the total number of experiments ever performed in the field in question). This doesn't necessarily imply impractically big sample sizes, although obviously this depends on the specifics (I believe the p value for a given value of the t-statistic decays exponentially with sample size).

[+] nonbel|8 years ago|reply

I don't follow your argument. You've got two premises:

1) You are saying that people are committing the transposing the conditional fallacy: p(H0|data) != p(data|H0):

- OK

2) You say to use Bayes theorem to get the value we want:

- OK, but actually a better formulation is

  p(H_0|data) = p(H_0)*p(data|H_0)/[p(H_0)*p(data|H_0) + p(H_1)*p(data|H_1) + ... + p(H_n)*p(data|H_n)]

You probably don't need to add up all the way to hypothesis n since the terms eventually become negligible and can be dropped from the denominator. The point is that you have to compare how likely the result would be under other hypotheses, not just H_0.

3) You propose lowering the threshold for "significance"

- How does this follow from the premises? Lets say you get a very low value for p(H_0)p(data|H_0), this can still be much higher than p(H_1)p(data|H_1), etc so it is still the best choice. Ie, you can get a low p-value given H_0 but if there is no better model out there you should still keep H_0.

[+] kkylin|8 years ago|reply

This was basically the suggestion here:

https://www.nature.com/articles/s41562-017-0189-z.epdf

Also previous HN discussion:

https://news.ycombinator.com/item?id=15192610

[+] nonbel|8 years ago|reply

Responding to this but getting rid of the intense nesting:

  “I'm not sure what you are arguing anymore.”
  
  It’s the claim I make in 3, and then the secondary claim that making our upper bound on p(null | T, normal, iid) small for significant p-values (i.e. p(T | null, normal, id)) could be used as a   criterion for whether our threshold for statistical significance is small enough.

  “You seemed to be disagreeing with that”

  I’m not sure what I said that gave that impression. I didn’t mention anything about the normal /   iid assumptions initially not because I thought we weren’t making these assumptions but because I   didn’t think these details were essential to my point."

Please give some example code or calculation steps for what you are talking about.

[+] btilly|8 years ago|reply

Here is a better way to think about this.

The proper role of data is to update our existing beliefs about the world. It is not to specify what our beliefs should be.

The question that we really want to answer is, "What is the probability that X is true?" What p-values do is replace that with the seemingly similar but very different, "What is the probability that I'd have the evidence I have against X by chance alone were X true?" Bayesian factors try to capture the idea of how much belief should shift.

The conclusion at the end is that replication is better than either approach. I agree. We know that there are a lot of ways to hack p-values. Bayesian factors haven't caught on because they don't match how people want to think. However if we keep consistent research standards, and replicate routinely, the replication rate gives us a sense of how much confidence we should have in a new result that we hear about.

(Spoiler. A lot less confidence than most breathless science reporting would have you believe.)

[+] kolpa|8 years ago|reply

This is like Functional programming , and people have a very hard time with it. Instead of passing around numbers "95% true" or whatever, we're passing around function "It's 2x as likely as you though it was, please insert your own prior and update", but even worse, it's "please apply this complicated curve function at whatever value you chose for your prior". It's just too hard for people to manage. Computers can do it (but it's hard for them too, very computationally intensive), and you have to really trust your computer program to be working properly (and you have to put your ego in the incinerator!) to hand over your decision-making to the computer.

[+] mycall|8 years ago|reply

> The proper role of data is to update our existing beliefs about the world. It is not to specify what our beliefs should be.

Create the schema beforehand, I get that. But feature extraction does work, providing models from data. Sometimes takes much time to analyze and understand the models.

[+] analog31|8 years ago|reply

My fear is that in 10 years time, we will have learned to hack Bayesian factors.

[+] sykh|8 years ago|reply

My favorite probability theory problem is related to this article.

You have a test for a disease that is 99% accurate. This means that 99% of the time the test gives a correct result. You test positive for the disease and it is known that 1% of the population has the disease. What is the probability you have the disease?

The answer is not at all the one most people think at first when given this problem. This problem is why getting two tests is always a good thing to do when testing positive for a disease.

EDIT: I updated the statement of the problem to be one that can be answered!

[+] joshuamorton|8 years ago|reply

>The answer is not at all the one most people think at first when given this problem.

The answer depends on the disease!

If it's the common cold, it's probably close to 99%. If it's huntington's disease, the likelyhood is much lower. (when asked, this question is normally posed as "you are given a test, which is 99% accurate, for some rare and deadly disease", the "rare" part is important)

[+] cdancette|8 years ago|reply

I'm not sure you phrased the problem correctly. If we follow your explanation, then the probability of having the disease is indeed 99%.

If you want to show the implication of Bayes' Theorem then you need to be more precise : Say you have a 1% of false positive and false negative rates (99% reliability) and 1% of the population is sick. If you are tested positive, then the probability of being sick is much less than 99%.

[+] smartician|8 years ago|reply

Let's see... Let's say you test 10,000 people, so about 100 actually have the disease. Since the test is only 99% accurate, only 99 of those will test positive. Of the remaining 9,900 actually negative people, 99 will test falsely positive. So if you test positive, you have a 50% chance of actually having the disease?

[+] zxv|8 years ago|reply

Great point. This is the effect of a low base-rate.

https://en.wikipedia.org/wiki/Base_rate_fallacy

Here's a paper on it's impact on network intrusion detection.

http://www.raid-symposium.org/raid99/PAPERS/Axelsson.pdf

[+] unknown|8 years ago|reply

[deleted]

[+] xelxebar|8 years ago|reply

> This problem is why getting two tests is always a good thing...

It's important to note that the tests results should ideally be as uncorrelated as possible. At worst a test always gives the same result as its first outcome, so further testing would give zero information.

In practice this means that you probably want tests that are based on completely different mechanisms.

[+] mlvljr|8 years ago|reply

[deleted]

[+] rossdavidh|8 years ago|reply

The core issue is that p-values are cheaper to get than replicating the study, but replicating the study is the only reliable way to see if it's true or not. Sometimes the expensive/time-consuming way, is the only good way.

[+] lisper|8 years ago|reply

Replication by itself is not enough. You need pre-registration too. Otherwise you can p-hack the replications.

[+] tyrankh|8 years ago|reply

I'm not trying to be facetious, but isn't this something you learn in junior-level stats? I had this drilled in in both undergrad math courses and grad machine learning courses; I'm confused to see it warrant an article.

[+] pmyteh|8 years ago|reply

It's well known what p-values show. But they are, in practice, used as a gatekeeping mechanism in academic journals in many fields (including mine). Worse, getting p<0.05 is informally seen as a measure of practical significance, rather than simply as one statistical test amongst many passed.

So yes, it is something you learn in introductory quantitative methods classes. But I don't think most researchers understand just how much it matters.

Also, a key R package for producing regression tables of coefficients for journal articles is called 'stargazer'. Given the unwarranted focus of many readers on those indicia of 'significant' results, I think it's well named.

I currently have the opposite problem. Given that I work with very large online datasets (N=1M or so) everything, including the random noise, is statistically significant to p<0.05. It really is effect sizes or busy at that point.

[+] whatshisface|8 years ago|reply

The harsh reality is that most scientists are not sieving through every statistics book they can get their hands on in order to find out all the reasons they might be wrong. The "individual motivation" to become statistics experts is only present in a few fields, and in the others it is ousted into applied courses taught by other departments.

Statistics is directly necessary in ML, so it's a "profit center" and emphasized. In many sciences it's treated like a cost center (something that you need, like IT, but that lies outside of your central expertise.)

[+] danieltillett|8 years ago|reply

To misquote Upton Sinclair you can’t get a scientist to understand statistics when their job depends on misunderstanding statistics.

The basic problem is under the current funding environment it is far better to pump out a dozen wrong papers than one carefully researched paper.

[+] rossdavidh|8 years ago|reply

The literal answer to your question is, "no, generally not". That a greater emphasis on statistics should be included in science is certainly the case, but then there is a school of thought that know a little bit of frequentist statistics is better than knowing none at all. But regardless, I am fairly confident that most scientists (or engineers) do not actually learn this as juniors (or seniors) (or Ph.D's)

[+] Fomite|8 years ago|reply

You can end up with a Ph.D. in some fields being exposed to almost no statistics, or only statistics which work in very confined settings (certain experimental sciences where "Just do an ANOVA... is genuinely the answer to almost every question).

That often works...right up until the moment when a scientist has to step outside that context.

This often cuts both ways though. I have seen beautiful math and statistics around problems that don't make any sense if you've taken more than one semester of microbiology.

[+] keithfma|8 years ago|reply

Andrew Gelman's blog provides regular insightful commentary on this issue, I highly recommend it:

http://andrewgelman.com/

The post that turned me on to all of this is at:

http://andrewgelman.com/2016/09/21/what-has-happened-down-he...

[+] amluto|8 years ago|reply

The article says:

> Note: this has nothing to do with p-hacking (which is a huge but separate issue).

I disagree. p-hacking is when one experimenter checks many statistical tests to find one that is significant. The effect the author is discussing is that many experimenters do many experiments and the significant ones get published. One is more unethical (or maybe just incompetent) than the other, but they’re essentially the same phenomenon.

[+] anbende|8 years ago|reply

They are the same in that they both create a situation in which the p-value cannot be trusted. However, in one case this is deliberate. In the other it's a problem with the whole enterprise.

Also, running multiple tests without correcting for multiple testing (usually by reducing the threshold for significance) is just one form of p-hacking. The more insidious version is when one runs the test after every few participants until random chance makes it "slip over the edge of significance". In that case there might not even be enough variables for multiple testing to have occurred, and it becomes very difficult to detect.

[+] Fomite|8 years ago|reply

The difference between something that requires a bad actor and one which is an outcome of a system working as intended is pretty huge.

[+] Malarkey73|8 years ago|reply

I'm honestly more tired of essays about p-values than p-values.

It's true that like all metrics if it becomes a target then it maybe abused (Goodharts Law).

However if you abolished p-values people would start hacking or misunderstanding priors or confidence limits or OR.

It's an easy dumb stat that most anyone can do in excel and most everyone recognises. The emphasis should be that it remains a quick shorthand for casual use but that more complex studies have more sophisticated models and probabilistic reasoning.

But the emphasis on the p-values is bizarre. As best illustrated by JT Leek the pipeline of data research has multiple points of failure that may lead to false findings or irreproducible research. But we talk very little about them whilst essays about p-values come out every week...

https://www.nature.com/news/statistics-p-values-are-just-the...

[+] aaavl2821|8 years ago|reply

This was a really interesting article. I've worked with researchers who try to defend a small but statistically significant finding that just doesn't seem likely to be real, and this provides a statistical explanation for my skepticism. The p-value mentality is deeply entrained in a lot of researchers, though

The challenge for journal editors seems very real. There's another group that deals with this challenge of interpreting the validity of significant findings for a living, though: biotech VCs. A lot of times trying to reproduce the work is their best way of addressing this, and often the first work done by startups is to try to replicate the academic work. For some other heuristics VCs use to assess "reproducibility risk", see here;

https://lifescivc.com/2012/09/scientific-reproducibility-beg...

[+] wmnwmn|8 years ago|reply

2 solutions: a) stop doing experiments that just look for correlation without any attempt to get at mechanism. Of course sometimes you can't avoid this and then 2) use lower p values. Don't waste thousands or millions (more) dollars following up 5% results.

[+] jssmith|8 years ago|reply

> Many researchers are now arguing that we should, more generally, move away from using statistics to make all-or-none decisions and instead use them for "estimation". In other words, instead of asking whether an effect is null or not, we should ask how big the effect is likely to be given the data.

I couldn’t agree more with this statement, and even moreso in a business setting than in research. It’s just so easy to get caught up in statistical significance and lose perspective on practical significance. I’ve found confidence intervals most informative and easy to understand.

[+] learnstats2|8 years ago|reply

When I was first taught statistics, I was told that the researcher had to justify a plausible hypothesis first - and then do a hypothesis test/p-value to prove their theory.

If this combination of the scientist's intuitive understanding and the p-value test result align, then this is a credible result.

On the other hand, the trend now is to conduct every possible test whether or not there is any justification for doing so (corrected for multiple testing, no p-hacking, yes, sure)

For example, in tech, we might test every shade of blue. Some of those blues are gonna come up as p-value hits - but since we had no good reason to do this test, this was probably just random noise.

Similarly, in genetics, we're gonna test every single gene against everything - just to see what happens (yes, yes, do a Bonferroni correction on each set of tests). Hmm, recent results in genetics don't seem to be very robust or repeatable, for some reason.

The likelihood of a truthful link in these tests is incredibly low. When have no particular reason to believe there is a truthful link, and are just blind testing, the false positive rate is very high (as described in the article), and probably even higher than the article speculates with - almost all hits are gonna be false positives.

Maybe p-values just don't work well with modern day data. Or, maybe, Big Data just doesn't contain information about mysterious, unexplored, and innovative correlations that we hope it does.

[+] teej|8 years ago|reply

“On the other hand, the trend now is to conduct every possible test whether or not there is any justification for doing so (corrected for multiple testing, no p-hacking, yes, sure)”

You are literally describing p-hacking.

[+] nonbel|8 years ago|reply

It isn't so much that there is no "truthful link", it is that everything is linked to everything else to some degree and the mathematical models they use for the null hypothesis are just "defaults". These assumptions are almost always violated. The statistical tests detect that, and are providing "true positives".

[+] kolpa|8 years ago|reply

If you run an experiment twice and the same shade of blue wins both times, that should persuade you that the winning shade is better. And if you keep replicating the experiment and it keeps winning, that should increase confidence further.

[+] platz|8 years ago|reply

Yes, Modern day data is not obtained from a controlled clinical trials

[+] unknown|8 years ago|reply

[deleted]

[+] unknown|8 years ago|reply

[deleted]

[+] piotrkaminski|8 years ago|reply

> instead of asking whether an effect is null or not, we should ask how big the effect is likely to be given the data. However, at the end of the day, editors need to make an all-or-none decision about whether to publish a paper

Yet another way in which the traditional publishing structure actively harms science.

[+] Bromskloss|8 years ago|reply

Do you have an alternative way of publishing in mind?

[+] stevenjluck|8 years ago|reply

Here's a follow-up to the original blog post: https://lucklab.ucdavis.edu/blog/2018/4/28/why-ive-lost-fait...

[+] alan-crowe|8 years ago|reply

I remember reading http://andrewgelman.com/2016/11/13/more-on-my-paper-with-joh...

with its graph "This is what power = 0.06 looks like". So I got the point that you have to have sufficient statistical power. A useful rule of thumb is that you need a power of at least 0.8. You need to have some idea how big the effect is likely to. Perhaps from previous exploratory research, from claims of other researchers, from reasoning "well, if this is happening the way we think it is, there has to be an effect greater than x waiting to be discovered.". Then you work out how big a sample size you need to use. Then you roll up your sleeves and get down to work.

But the reason for using p values rather than Bayesian inference is that it gets you out of the tricky problem of coming up with a prior. You only need to think about the null hypothesis and ask yourself whether the probability of the data, given the null hypothesis, is less than 0.05.

So there is a bit of contradiction. p values don't really work unless you ensure that you have sufficient power. To do that you need a plausible effect size, to feed into your power calculation. And that is implicitly an rough approximate prior, 50:50 either null or that effect. You could just do a Bayesian update, stating how much you shifted from 50:50.

Basically, if you don't already know enough to have an arguable prior to get a Bayesian approach started, you don't know enough to do a power calculation, so you shouldn't be using p-values either.

I went looking on andrewgelman.com for a reference for wanting power = 0.8 and found a more recent post

http://andrewgelman.com/2017/12/04/80-power-lie/

Oh shit! The situation is much worse than I realised :-(

[+] thousandautumns|8 years ago|reply

> But the reason for using p values rather than Bayesian inference is that it gets you out of the tricky problem of coming up with a prior.

It technically doesn't even do this. Using a frequentist approach is equivalent to a Bayesian approach with an uninformative prior, which is itself an assumption being baked into the analysis, only one that is almost unquestionably incorrect. Its essentially saying you have literally no idea how a data is being generated, which is certainly not true.

[+] vcdimension|8 years ago|reply

James Abdey wrote his Ph.D. on this subject several year ago and proposed an alternative method for making decisions based on statistical evidence: http://etheses.lse.ac.uk/31/

[+] thanatropism|8 years ago|reply

This is an old thread already and I don't know if I'm getting my voice heard. But at any rate: hypothesis testing (slightly different philosophically from p-values, but anyway) is bogus because conjectures-and-refutations falsificationism is bogus. That's not how good science has ever happened, only how bogus research programs have dressed themselves in science.

The core of science is "the unity of science". Signal-to-noise measurements tell you very little outside a general coherentist/holistic verificationist framework.

[+] NPMaxwell|8 years ago|reply

About unity of science: https://en.wikipedia.org/wiki/Unity_of_science

173 comments