P values are not as reliable as many scientists assume

[+] Blahah|12 years ago|reply

This is probably the most annoying problem in my daily life (yeah I know, first-world problems). I have daily conversations with biologists where I've analysed some data and associated (say) a posterior probability with each condition in the model. They insist I give them p-values or something they can present as though they were p-values. At the beginning of my PhD I complied. Then they throw out all the nuance in the data, put one asterisk for p < 0.05, two asterisks for p < 0.01, etc., and (this is the horrifying part) _believe_ that an asterisk indicates that something is true. They put stupid asterisks all over my beautiful plots and then think their arbitrary cutoffs mean something biologically meaningful. I die a little inside every time I see an asterisk on a plot.

Now I refuse to use p-values and deliberately construct analyses that are incompatible with Fisherian statistics. And rather than giving people raw numbers, I produce a massive document of interpretation. Takes a huge amount of time, but I'm hoping it will mean my publishing track will contain significantly (ha!) fewer false results than most biologists'.

[+] gabemart|12 years ago|reply

I was recently trying to explain Bayesian logic to a friend, and came up with the following analogy. I would be interested to hear feedback on it.

---

Imagine everyone in the USA gets sudden amnesia. We want to find out who the President is, but no one can remember.

A scientist comes up with a test to determine if someone is the President.

If they are the President, there is a 100% chance the test will say they are the President and a 0% chance the test will say they are not the President.

If they are not the President, there is a 99.999% chance the test will say that are not the President, and a 0.001% chance the test will falsely say they are the President.

Giving the test to the person sitting in the big chair in the Oval Office is useful, because it's already quite likely this person is the President. If the test is positive for Presidency, it's extremely likely that person is the president.

Giving the test to the 10 people nearest the oval office is useful, because it's fairly likely the President is one of these people. A positive result will indicate strongly that that person is the President, and if no-one in that group is actually the President, there's a 99.99% chance the test will say so.

Giving the test to the 1000 people in the White House is pretty useful, because it's pretty likely the President is in the White House, and if none of these people are the president, there's still a 99% chance the test will be correct. A positive result for any one person will indicate quite strongly that that person is the President.

But giving the test to everyone in America is not very useful at all, because it's very unlikely that any particular person is the President, and we can expect the test will give a positive result for around 3200 people. For any particular person in this group, it's much more likely they're not the President than they are.

---

Is this a broadly correct, if non-rigorous, analogy? I realize most HNers will be much more familiar with this stuff than I am, I'm interested chiefly in whether or not I misled my friend.

[+] klodolph|12 years ago|reply

Right. And the frequentist version is also useful.

If you test one person and the the test is positive, then that person is the president (p=0.00001).

If you test a thousand people, and the test is positive for one of them, then that person is the president (p=0.01).

So you don't really need Bayesian logic to reason that you should test fewer people if you want a more significant result. (Note I'm not saying you don't need Bayes' Theorem, which everyone uses.)

Edit: I think most people on HN get their knowledge of frequentist and bayesian statistics from XKCD #1132. That's sad.

[+] pcrh|12 years ago|reply

A big problem with Bayesian statistics (as I see it) it that it not always possible to have any sense of what the prior probability is.

Say you are looking for genes that might influence the rate of occurrence a particular disease. There might be genes that influence this rate, or there might not, it could be entirely environmental, or it could be entirely genetic, or something in between. In any case, you go genome-wide studies, and find that certain gene variants occur more often in your diseased population than in your control population. You apply frequentist statistics, using some corrections for multiple hypothesis testing, and get some kind of "significant" result. This gets published in Nature (you lucky thing!).

Are your conclusions correct? Do the genes you identified really modify the course of the disease you studied? Bayesian statistics won't give you the answer.

The only way to get the answer is to do experimental science, i.e. deliberately modify the gene(s) in question and show that your modifications change the occurrence or course of the disease.

Unfortunately, that is not always feasible, for either technical or ethical reasons, so we have to fall back on the poor cousin of experimental science that is population statistics.

[+] doctorpangloss|12 years ago|reply

>A scientist comes up with a test to determine if someone is the President.

It's a poor analogy, because it's not clear to people why such a test is "natural." It's not clear how your specific test could be broken in the peculiar way that it would have 99.999% chance to confirm that someone who isn't the president is indeed not the president.

And people would get caught up with what you mean by "If they are" and "If they are not," since it's not clear how you would know the error of your test without a real president around to identify.

False positives or false negatives are not at all intuitive to people who have never done experimental design. Most people would get stuck at percentages anyhow.

[+] reyan|12 years ago|reply

This is the correct xkcd for this analogy http://xkcd.com/882/, not #1132.

[+] joe_the_user|12 years ago|reply

OK,

So just to get a handle on this stuff, the two problems you have if you do a test and only look at a low p are:

1) Unusual things do occur. If a million people do the same test, it's obvious they'll come up with some wrong values. It's less obvious that a similar number of wrong values will come if a million people do a million different tests with a similar small chance of bogus results.

2) The pattern of results may indeed be unusual but not necessarily in the fashion you think it is. There may be a non-random pattern possessed by the data but may not because of your particular hypothesis but a "this is not random" result may seem to say your hypothesis does explain the data.

Does that characterize the problem?

[+] tlarkworthy|12 years ago|reply

yeah its similar to a classic rare disease screening example in text books:

http://www.math.hmc.edu/funfacts/ffiles/30002.6.shtml (sorry its a bit informal, it was first google hit, but I have seen that in real textbooks for sure)

Although its a bit confusing associating the prior with geometric proximity to the oval chair.

[+] tedsanders|12 years ago|reply

xkcd's version sounds similar, but simpler: http://xkcd.com/1132/

[+] unknown|12 years ago|reply

[deleted]

[+] jawns|12 years ago|reply

This is the reason why I don't include P values on http://www.correlated.org.

They would muddle my otherwise irreproachable statistics.

[+] klodolph|12 years ago|reply

I bet the p-values on that site would be very high if properly calculated. Generally, if you are reporting correlations between a large number of variables, the p-values shoot through the roof.

Of course, TONS of people forget this and publish a p-value as if those two variables are the only ones under consideration. Which is just sad.

[+] mbateman|12 years ago|reply

Holy crap this site is fantastic. Thanks.

[+] moontear|12 years ago|reply

Woah, why didn't I know about this site? Awesome! Would be great if you'd include links to you original sources. Even though you don't include P-values in the graphics, on the detail page I would still love them just as much as Chi-square. Your tagline should be "correlation does not imply causation" ;-)

[+] relaunched|12 years ago|reply

The last sentence in this paragraph is hilarious:

P values have always had critics. In their almost nine decades of existence, they have been likened to mosquitoes (annoying and impossible to swat away), the emperor's new clothes (fraught with obvious problems that everyone ignores) and the tool of a “sterile intellectual rake” who ravishes science but leaves it with no progeny3. One researcher suggested rechristening the methodology “statistical hypothesis inference testing”3, presumably for the acronym it would yield.

[+] snowwrestler|12 years ago|reply

Statistics are descriptive, not predictive--period.

I'm continually surprised at how many people either don't know, or don't internalize, that. Look at how often "risk factors"--which are a descriptive concept--are converted to advice--which is predictive.

Doing so in the absence of a causal hypothesis is a basic violation of "correlation does not equal causation."

If you want to construct a scientific theory you must be able to articulate some predictive tests, and that means you must hypothesize a causal mechanism.

[+] jules|12 years ago|reply

Yet the only reason we are interested in statistics is for the predictive aspect. This is exactly where hypothesis testing goes wrong: as much as you may claim that you're doing something purely descriptive, the whole point of the exercise is to make decisions. Hypothesis tests simply don't give us the right information to base a decision on, but in practice people still do make decisions based on them in a fundamentally incorrect way.

[+] cscheid|12 years ago|reply

That's because your short statement is not really representative of good statistical practice. For example, people spend a lot of time researching http://en.wikipedia.org/wiki/Generalization_error, and models like http://en.wikipedia.org/wiki/Probably_approximately_correct_... worry a lot about things like VC dimension exactly because they characterize the behavior of statistical models to unseen data. Or maybe you don't think of prediction as "behavior under unseen data"?

[+] mtdewcmu|12 years ago|reply

True-- Correlation does not imply causation. But-- Correlation implies publication.

[+] cschmidt|12 years ago|reply

As in "I used to think correlation implied causation. Then I took a statistics class. Now I don't..."

http://xkcd.com/552/

[+] dllthomas|12 years ago|reply

Reliable correlation is all you need to make predictions. If you understand causation, you can be more confident about the reliability (or lack thereof) of the correlation under changing conditions.

[+] baddox|12 years ago|reply

> Statistics are descriptive, not predictive--period.

Does the law of the large numbers somewhat unify descriptiveness and predictiveness?

[+] gwern|12 years ago|reply

Further reading: http://lesswrong.com/lw/g13/against_nhst/

[+] shas3|12 years ago|reply

The root of this problem is that most data sets in psychology, anthropology, and epidemiology are not as large in terms of sample size as what computer scientists and electrical engineers encounter. p-values are a surrogate to explicitly describing the data using probability distributions or as random processes. In essence, you sacrifice granularity for simplicity. If you look at the original works of Fisher, etc. and their widespread utility, a large part of early statistics is intended for 'practical statisticians' who seldom encounter data-sets that are 'small' in terms of sample size. As someone who works in electrical engineering/computer science, I've never used the p-value because:

1. The field, in general, demands far more mathematical rigor when dealing with statistics.

2. The demand for mathematical rigor is justified because most data sets we deal with are many orders of magnitude larger than what psychologists and others encounter. So predictions based on limit theorems, etc. are often testable.

[+] JoshTriplett|12 years ago|reply

I'd love to see a comprehensive article that shows what a research paper's analysis would look like using Bayesian methods. I've seen plenty of general hints about Bayesian methods, discussion of priors, and similar, but I haven't found any specific guide on how to apply those methods to the types of research papers that would traditionally use a null hypothesis significance test with a p value.

[+] Homunculiheaded|12 years ago|reply

Not articles but there are two very excellent books on the subject that I can't recommend enough:

If you read calculus with about the same fluency as the comic books then "Data Analysis: A Bayesian Tutorial" is awesome http://www.amazon.com/Data-Analysis-A-Bayesian-Tutorial/dp/0...

And if you would like a little more exposition (but still a mathematically sophisticated treatment) "Doing Bayesian Data Analysis: A Tutorial with R and BUGS" is fantastic http://www.amazon.com/Doing-Bayesian-Data-Analysis-Tutorial/...

The latter will also give you more details of how to approach classical, frequentest tests and summary statistics with their Bayesian equivalent.

Honestly I would say get both books as they're cheap and provide different insights. You only need to read a few chapters of each to see how you approach basic experiments from a Bayesian perspective.

[+] klodolph|12 years ago|reply

As a complement to pure Bayesian reasoning, I recommend Mayo's "Error and the Growth of Experimental Knowledge". http://www.amazon.com/Experimental-Knowledge-Science-Concept...

[+] gwern|12 years ago|reply

Offhand, you might find interesting Kruschke's "Bayesian estimation supersedes the t-test" http://www.indiana.edu/~kruschke/articles/Kruschke2013JEPG.p...

[+] Fomite|12 years ago|reply

There's a number of tutorials scattered throughout the International Journal of Epidemiology, Epidemiology and the American Journal of Epidemiology, as well as good "exemplar" articles either using Bayesian methods, or using both approaches.

[+] tokenadult|12 years ago|reply

As the article reports, "Perhaps the worst fallacy is the kind of self-deception for which psychologist Uri Simonsohn of the University of Pennsylvania and his colleagues have popularized the term P-hacking; it is also known as data-dredging, snooping, fishing, significance-chasing and double-dipping. 'P-hacking,' says Simonsohn, 'is trying multiple things until you get the desired result' — even unconsciously."

Simonsohn, has a whole website about "p-hacking" and how to detect it.

http://www.p-curve.com/

He and his colleagues are concerned about making scientific papers more reliable. You can use the p-curve software on that site for your own investigations into p values found in published research.

Many of the interesting issues brought up by the comments on the article kindly submitted here become much more clear after reading Simonsohn's various articles

http://opim.wharton.upenn.edu/~uws/

about p values and what they mean, and other aspects of interpreting published scientific research. He also has a paper

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2259879

on evaluating replication results with more specific tips on that issue.

"Abstract: "When does a replication attempt fail? The most common standard is: when it obtains p>.05. I begin here by evaluating this standard in the context of three published replication attempts, involving investigations of the embodiment of morality, the endowment effect, and weather effects on life satisfaction, concluding the standard has unacceptable problems. I then describe similarly unacceptable problems associated with standards that rely on effect-size comparisons between original and replication results. Finally, I propose a new standard: Replication attempts fail when their results indicate that the effect, if it exists at all, is too small to have been detected by the original study. This new standard (1) circumvents the problems associated with existing standards, (2) arrives at intuitively compelling interpretations of existing replication results, and (3) suggests a simple sample size requirement for replication attempts: 2.5 times the original sample."

[+] analog31|12 years ago|reply

"If your experiment needs statistics, you ought to have done a better experiment." -- Ernest Rutherford

[+] yread|12 years ago|reply

I don't understand how did they get to the number 71% percent for 0.05 p-value?

0.05 p-value means that there is 5% probability that (for a t-test as example) a difference in averages of two sequences (the statistic) is by chance and not because of difference in means of their underlying normal distribution.

I assume that the "toss-up" means that there is no difference in the means in reality (so the null hypothesis is true). Am I understanding it correctly? Shouldn't in this case the probability of getting p-value < 0.05 be in fact less than 5% and not 29%?

How did they get the 29%?

[+] cwyers|12 years ago|reply

If you're interested in the subject, some additional reading (warning, PDF):

http://www.deirdremccloskey.com/docs/jsm.pdf

[+] yetanotherphd|12 years ago|reply

Most of the objections here and in the article are not inherent problems with frequentist p-values.

First, the reported p-value might be wrong. E.g. basing it on assumptions of normality when the data is non-normal. However modern non-parametric approaches like the bootstrap can avoid this issue.

Second, testing multiple hypotheses. If you test 10 hypotheses then you cannot reject the null (that all 10 null hypotheses hold) simply because one single hypothesis is rejected in isolation. But this is well known, and failing to account for it is an issue with the researcher, not with frequentist statistics. I actually think that the main practical difference between Bayesian and Frequentist statistics is whether accounting for the issue of multiple hypotheses is done formally or informally.

[+] hootener|12 years ago|reply

The article doesn't bash the p-value as a statistical test specifically, more its use and interpretation by scientists over the years.

You're absolutely correct about using non-parametric tests, and more scientists should be using them. The normality assumption is flat out laughable when using real-world data most of the time.

You're also correct about multiple hypothesis testing. Accounting for familywise error (e.g., Holms adjustments) can help to keep your p-value reporting honest.

That doesn't negate the underlying problem, though. A p-value is simply an indication, nothing more. The p-value never promised to be more than that. The issue isn't in the p-value's construction, the issue lies in its misuse and how easily it can be abused in statistical reporting (see: p-hacking).

The p-value as a test statistic is perfectly honest in my opinion. But like many other statistical methods, it comes with its own set of baggage that I feel gets conveniently glossed over more often than it should.

[+] mandor|12 years ago|reply

I fully agree with the critics about the p-values, but what are the best alternatives to analyze and compare data? Most of the time, scientists have to compare the outcome of treatment 1 versus treatment 2; how should they do it "properly"?

What is the HN recommandation?

[+] Fomite|12 years ago|reply

Effect measures. Don't just report your p-value, report the actual effect measure, and a measure of uncertainty around it, be it a frequentist confidence interval, Bayesian posterior distribution, etc.

More information is better.

[+] socrates1998|12 years ago|reply

I have always thought people rely on the normal distribution too much.

Does it work? Sometimes.

The problem is that people tend to believe something they use a lot.

Even the 0.05 threshold is sort of made up.

Correlation does not mean causation.

[+] jheriko|12 years ago|reply

p-values need not arise from the normal distribution - but any distribution - selecting the right one is another cause for error when producing them.

the normal distribution is also quite well justified by the central limit theorem.

i do however agree that a p-value of 0.05 is not worth very much.

[+] jessaustin|12 years ago|reply

Careful, don't poke the hornet's nest:

https://news.ycombinator.com/item?id=7064435

[+] Finster|12 years ago|reply

> In 2005, epidemiologist John Ioannidis of Stanford University in California suggested that most published findings are false; since then, a string of high-profile replication problems has forced scientists to rethink how they evaluate results.

That's what is supposed to happen, though, right? You publish your findings. Others try to reproduce. They publish THEIR findings, etc. etc. If most published findings are false, it sounds like the process is working as designed.

[+] pessimizer|12 years ago|reply

Bad papers can be generated. published, and cited a lot faster than failures to replicate.

[+] loderunner|12 years ago|reply

Another reason to be skeptical of the statistics thrown around in popular news.

[+] milliams|12 years ago|reply

This is why, to be on the safe side, in particle physics we have a requirement of a p-value of 0.0000003 for a discovery.

[+] stinos|12 years ago|reply

But isn't the whole point that no matter how low the P-value is, it is not a reliable measure?

[+] crocowhile|12 years ago|reply

(facepalm)

[+] mrcactu5|12 years ago|reply

i routinely get jobs from doctors at prestigious universities who say, "here's at study with 3 samples, see if we can get p < 0.05"

[+] jheriko|12 years ago|reply

a p-value of 0.05 or even 0.01 is stupidly high. it only takes a little thought experiment about what that means in reality to realise how permissive it is and you can find demonstrations of this without going particularly far, looking very hard or being especially well educated...

consider the wikipedia example with heads vs. tails.

http://en.wikipedia.org/wiki/P-value#Examples

the idea that 5 coin tosses can produce a p-value < 0.05 that 'demonstrates' that the coin is biased towards heads is intuitively 'obviously wrong'. even if we take it to 10 coin tosses (the p-value you get is 0.001 - which looks really strong if we accept that 0.01 is acceptable) it clashes with my own ideals for what statistical significance should mean. this is in a loose way a proof by contradiction that p-values of 0.05 or 0.01 do not have utility (at least for these kinds of small n).

aside from that consider running the experiment 5 times or 20 times. how many false positives do you expect? what is the expected number of false positives? is that significant?

it also bothers me how connected to the problem formulation that the value itself is. if we analyse the same situation with an identical test but a different formulation of the problem that the values differ?

why is five heads in a row less significant as a result when the test is whether a coin is biased at all rather than a test that it is biased towards heads only? sure i understand the probability involved there that we have all these potential coins biased towards tails that mean nothing in the first case - but there is something very deeply wrong with that.

shouldn't this be the other way around? if 5 consecutive heads is good evidence that a coin is biased towards heads, isn't it equally good evidence that it is biased at all? classical logic says that it is because being biased towards heads is a subset of being biased in either direction. the truth is that it really is equally good evidence - i challenge someone to explain why it is not! ( actually i kinda want to be wrong about that because i might learn something new then :) )

probability is counter-intuitive and useless for the kinds of small n usually used in experiments - the intuition about it recovers when we deal with sensible n - numbers like 1000 or 10000 - but these are still small n really if you need to scale up, or be confident that your result is correct. even at 100 samples its obvious that our idealisation of percentage and what happens in reality do not marry up neatly...

to make a very crude software analogy what about those 1 in 10,000 bugs? they are still a very real problem if you have millions of customers...

or - IMO even 10,000 is a very exceedingly small n to try and draw robust conclusions from.

[+] sp332|12 years ago|reply

0.05 == 5% == 1/20. If you flip a coin 5 times and get heads every time, do you intuitively feel that there is more than 1-in-20 odds that the coin is fair?

You should really get used to the idea that stating a different problem will give you a different answer. You need to be very careful when asking a question, or your answer might not mean what you think it means.

[+] allochthon|12 years ago|reply

In non-technical terms, ten heads in a row is sufficient to start wondering about a coin. Try a coin flipping experiment and see if you agree. :)

128 comments