The article submitted here leads to the American Statistical Association statement on the meaning of p values,[1] the first such methodological statement ever formally issued by the association. It's free to read and download. The statement summarizes into these main points, with further explanation in the text of the statement.
"What is a p-value?
"Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.
"Principles
"1. P-values can indicate how incompatible the data are with a specified statistical model.
"2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
"3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
"4. Proper inference requires full reporting and transparency.
"5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
"6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis."
[1] "The ASA's statement on p-values: context, process, and purpose"
I remember the feeling after my first undergraduate course in statistics essentially being that we stated these principles, then spent the remaining weeks essentially invalidating them without offering any real alternatives. My professor may have been more careful than I remember, but the subtleness was lost on me at the time if that was the case.
The testing of statistical hypotheses always seemed like an odd area of the mathematical sciences to me, even after later taking a graduate mathematical statistics sequence. Like an academic squabble between giants in the field of frequentist inference (Fisher vs. Neyman and Pearson) that ended suddenly without resolution, with the scientific community decided to sloppily merge the two positions for the purposes of publication and forge onward.
> 2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Sure, but is there not a significant correlation between the two in practice? Or would you trust something that gives a 1% p-value equally as one that gives a 99% p-value?
(Yes, I realize it's easy to construct counterexamples, hence why I asked "in practice".)
Here is what probability theory teaches us. The proper role of data is to adjust our prior beliefs about probabilities to posterior beliefs through Bayes' theorem. The challenge is how to best communicate this result to people who may have had a wide range of prior beliefs.
p-values capture a degree of surprise in the result. Naively, a surprising result should catch our attention and cause us to rethink things. This is not a valid statistical procedure, but it IS how we naively think. And the substitution of a complex question for a simpler one is exactly how our brains are set up to handle complex questions about our environment. (I'm currently working through Thinking Fast and Slow which has a lot to say about this.)
Simple Bayesian approaches take the opposite approach. You generally start with some relatively naive prior, and then treat the posterior as being the conclusion. Which is not very realistic if the real prior was something quite different.
Both approaches have a fundamental mistake. The mistake is that we are taking a data set and asking what it TELLS us about the world. When in probability theory the real role of data is how to UPDATE our views about the world.
This is why I have come to believe that for simple A/B testing, thinking about p-values is a mistake. The only three pieces of information that you need is how much data you are willing to collect, how much you have collected, and how big the performance difference is. Stop either when you have hit the maximum amount of data you're willing to throw at the test, or when the difference exceeds the square root of that maximum amount. This is about as good as any simple rule can do.
If you try to be clever with p-values you will generally wind up putting yourself at saving yourself some effort in return for a small risk per test of making very bad mistakes. Accepting small risk per test over many tests for a long time puts you at high odds of eventually making a catastrophic mistake. This is a very bad tradeoff.
I've personally seen a bad A/B test with a low p-value rolled out that produced a 15% loss in business for a company whose revenues were in the tens of millions annually. It. Was. Not. Pretty. (The problem was eventually found and fixed..a year later and after considerable turnover among the executive team.)
It's not just p-values. Some people just don't understand even very basic statistics.
I remember talking to one person in marketing who ran surveys of the company's users. They would send out a survey to all registered users, get back responses from 1% of them or something, and then proceed to report findings based on the responses. They were really happy, since a 1% response rate is great for surveys like this.
I tried to explain to them that all of this statistical machinery relies on having a random sample, and a self-selected sample is not that. No effect whatsoever. Surveys like this are standard practice in the industry. Why are you making trouble, geek-boy?
P-values are so weird. Studies should instead report a likelihood ratio. A likelihood ratio is mathematically correct, and tells you exactly how much to update a hypothesis.
You can convert p-values to likelihood ratios, and they are quite similar. But its not perfect. A p value of 0.05 becomes 100:5, or 20:1. Which means it increases the odds of a hypothesis by 20. So a probability of 1% updates to 17%, which is still quite small.
But that assumes that the hypothesis has a 100% chance of producing the same or greater result, which is unlikely. Instead it might only be 50% which is half as much evidence.
In the extreme case, it could be only 5% likely to produce the result, which means the likelihood ratio is 5:5 and is literally no evidence, but still has a p value of 0.05.
Anyway likelihood ratios accumulate exponentially, since they multiply together. As long as there is no publication bias, you can take a few weak studies and produce a single very strong likelihood update.
For context, I have a degree in statistics, and I did research with Andrew Gelman (one of the statisticians quoted in the article).
Glad to see this is gaining traction! I've been saying this for years: the world would actually be in a better place if we just abandoned p-values altogether.
Hypothesis testing is taught in introductory statistics courses because the calculations involved are deceptively easy, whereas more sophisticated statistical techniques would be difficult without any background in linear algebra or calculus.
Unfortunately, this enables researchers in all sorts of fields to make incredibly spurious statistical analyses that look convincing, because "all the calculations are right", even though they're using completely the wrong tool.
Andrew Gelman, quoted in the article, feels very strongly that F-tests are always unnecessary[0]. I'd go as far as to extend that logic to the Student's t-test and any other related test as well.
You can get into all sorts of confusing "paradoxes" with p-values. One of my favorites:
Alice wants to figure out the average height of population. Her null hypothesis is 65 inches. She conducts a simple random sample, performs a t-test, and determines that the sample mean is 70 inches, with a p-value of .01.
In an alternate universe, Bob does the same thing, with the same null hypothesis (65 inches). He determines that the sample mean is 90 inches, with a p-value of .000001.
Some questions:
A) Does Bob's experiment provide stronger evidence for rejecting the null hypothesis than Alice's does?
B) In Bob's universe, is the true population mean higher than it is in Alice's universe?
By pure hypothesis testing alone, the correct answer to both questions is "no", even though the intuitive answer to both questions is "yes"[1].
[1] Part of the problem is that we do expect that, in Bob's universe, the true population mean is highly likely to be higher, and this is supported by the data. Trouble is, the reason we expect that is not formally related to hypothesis testing and t-tests/p-values.
If you're doing a measurement, why have a null hypothesis? Alice should sample the population at random, take the height measurements, calculate the average, plot the distribution, calculate the variance. If the distribution is not sufficiently smooth then continue to take measurements until it's smooth, or unchanging. Then Alice is done discovering all there is to know about the height distribution of the population she sampled. Same with Bob.
This paradox is a great example of why I'm uncomfortable with the idea of coming up with a p-value when you're comparing a sample mean to a point value in the first place.
If your null hypothesis amounts to little more than a scalar value that exists in a vacuum, you haven't really got a null hypothesis. If your null hypothesis is that your results will match what was seen in some other data set collected by some other person who may or may not have been using the same protocol as you, your null hypothesis describes a parallel universe and there's no way to draw an apples-to-apples comparison to it and your alternative hypothesis, which concerns data that did not come from that parallel universe.
So I guess my answer to all questions would be, "Mu."
I can't understand what is the problem with p-values. You should never talk about the results, the essential point is that the method provides valid conclusions 95% and invalid conclusions 5% of the time. If your conclusion is wrong (you are in the 5% part) that is not a paradox. Also you should not try to prove things since your conclusion can be wrong, you should be glad to have a procedure that gives you useful information but is not infallible. By being humble you solve the problem.
For giggles and grins, my aunt and uncle are an actual example of one of the classic frequentist vs bayesian examples where frequentist statistics says that something utterly irrelevant should matter.
Scenario 1 (true). Bill and Lorena had 7 children. 6 were boys, 1 was a girl. Are they biased towards having one gender? A 2-sided p-value says that there are 16 possibilities of this strength or more, each of which has probability 1/2^7, for a p-value of 2^4/2^7 = 1/2^3 = 0.125. We therefore fail to reject the null hypothesis at a p-value of 0.05.
Scenario 2 (also true): Bill and Lorena decided to have children until they had a boy and a girl. They had 6 boys then a girl. Are they biased towards having one gender? An event this unlikely could only happen if they had 6 of the same gender in a row, which is 2 possibilities of probability 1/2^6, for a probability of 2/2^5 = 1/2^5 = 0.03125. We therefore reject the null hypothesis at a p-value of 0.05.
Now the Bayesian gotcha. According to Bayes' theorem, the intent of Bill and Lorena can have absolutely NO impact on ANY calculation of posterior probabilities from prior expectations. There is no logical way in which this fact should matter at all. And yet it did!
If Alice just wants to know the average height of the population, why is she doing a hypothesis test that the height isn't 65 inches?
Since her hypothesis test is designed to help answer the question of whether or not the mean population height is 65 inches, why should we expect it to tell us anything about the mean population height other than whether or not it being 65 is consistent with the data observed?
Where is the "paradox"? I assume that you mean that by pure hypothesis testing both Alice and Bob reject the null hypothesis at the alpha=0.05 level, for example. The fact that the p-value is irrelevant is by design: you perform the test and either you reject the null hypothesis or you don't. But in fact most people won't do "pure" hypothesis testing and will conclude (somewhat incorrectly) that the evidence against the null hypothesis is indeed stronger in Bob's experiment.
Edit: and why would be the answer to the second question "no"? The hypothesis testing procedure doesn't provide any point estimate at all so the question doesn't really mean anything in that setting.
Thanks! This is one of the best explanations of the problem with using p-values.
Furthermore, I can recall many times when my stats professors would make it very clear what the "right" answer was, despite it being counter-intuitive but without explaining
Is the p-value really not the probability of your results being due to chance? Is that not a perfectly valid definition of it?
I suppose 'chance' is a little hand-wavy, but isn't a p-value just the probability of your data given that your hypothesis is false? Isn't that literally and precisely the probability that they occurred by chance?
Is the p-value really not the probability of your results being due to chance?
No, it's the probability of a particular observation, given that we assume the result is due to chance. This sounds similar, but the difference is that it doesn't say anything about the probability of your result outside the context of the study's hypotheses.
It's the probability of you seeing your results due to chance if there is no effect.
This differs from the probability of the results being due to chance because it does not take into account the probability of your hypothesis being true or false.
If you observe something that would disprove e.g. General Relativity with a P of 0.001 it is much more likely to be due to chance than if you observe something that is consistent with known science with a P of 0.001, as the weight of all the evidence for General Relativity is very strong.
I'm probably jumping into shark filled waters considering the point of the article is that defining p-values intelligibly is incredibly difficult even for experts, but here goes....
> Is the p-value really not the probability of your results being due to chance?
No, it is not. It is the probability that the statistical attributes of the data would be equal or more extreme than observed assuming the Null Hypothesis is true.
In your definition, there is no assumption of what model the data should conform to...so what does "by random" mean in that context? Also, by random doesn't mean 'not predictable' or useless. If I roll a 100 sided die an infinite number of times, I'd expect the number of observances of '5' to approach .01 of the total distribution. So, the probability of rolling a 5 by random chance is 1 in 100. However, I would not reject my Null Hypothesis since my model predicts exactly this random behavior.
Now, if I rolled a die 1000 times and rolled a 5 every time the mean of that distribution (5) would be very, very far from the expected mean of my model if the Null is assumed to be true. And I may be tempted (very) to reject the Null that I am rolling a 100 sided fair die.
I will now sit back and wait for my definition and analogy to be torn to shreds. :)
You have to be very specific about "probability that they occurred by chance." If p = 0.05, you can't say "there's a 95% chance this result is real and a 5% chance it's just a fluke." You can say "if chance is the only thing operating, we'd see a result like this only 5% of the time."
In conditional probability notation, it's the difference between P(result | it's just chance) and P(it's just chance | result).
Imagine I handed you a 20-sided die. I claim it says 7 on every side, but I might be lying. You roll a 7. What are the chances it actually has 7 on every side?
You can't actually say unless you either (1) roll the die more times, or (2) assume something about the probability that I gave you an all-7's die to begin with.
Doing (2) is useless, because that exactly the question we are trying to answer.
For example, suppose I perform this experiment all the time and I know that I give an all-7's die only 1% of the time. With this new information, you could actually calculate the probability of an all-7's die given a 7 roll. Of all the possible outcomes, you could add up the ones where I gave you an all-7s die and the ones where I gave you a normal die but you just rolled 7. Then you could divide that by the total number of possible outcomes.
But this would give you a totally different number than if I give you an all-7's die 99% of the time. And the problem is that you don't have any information about what kind of die you have before you roll it. You're trying to figure out which world we live in -- one where your hypothesis is true or one where it's not.
(I am pretty sure that what I wrote above is true. But one thing I'm not as clear on is how multiple rolls of the die actually can establish confidence percentages. How many rolls does it take to actually establish confidence? Would love to hear from any stats experts about that.)
> I suppose 'chance' is a little hand-wavy, but isn't a p-value just the probability of your data given that your hypothesis is false?
It is the probability of achieving a result at least as far from the hypothesized value exclusively due to random variation that is uncorrelated with the explanatory variable(s) at hand [and subject to a number of other assumptions].
That's not what a p-value is. A p-value is the probability of getting by random chance a result at least as extreme as the measurement. This is not the same as the probability that the effect you measured is due to chance. The latter isn't even well defined without additional assumptions.
Next step is explaining social science students the meaning of 'randomness' :)
Really, the amazing amount of bullshit social studies I have seen 'proven' by statistics. Amazing new insights like 'if children wear green shirts, while the teacher has a blue shirt the cognitive attention span is 12.3% higher than children wearing purple shirts. The effect was measured with a significance of bla bla bla.'
Software like SPSS facilitates this even more. People with no notion of random effects or probablilty theory click on the 'proof my research' button and even get it published.
In undergrad, I learned how to about p-values but never quite understood how they were actually useful. Now, as a bioinformatics graduate student, I've come to understand that my original instinct was right all along.
Statistical significance is difficult to ensure. Certainly, one should be suspicious if 0.05 is ever used as a significance threshold, because it's unlikely that exactly one hypothesis under one regime was tested in any given paper.
I am glad that the article's headline is clear that it's time to stop misusing P-values. Tests for statistical significance should still be used, and to abandon them would be foolish. In a sense, though, they are the beginning, not the end, of assessment.
p-value analysis has its big caveats like multiple comparisons, but Bayesian has its own, such as it's extremely hard to calculate priors. Both are challenging to use in difficult analysis and both can be abused.
The #1 problem is with p-values is the word "significant". We should use "detectable" instead. Significant implies meaningful to most people, but not in a statistical context. This is quite confusing. Detectable is better because the mainstream meaning aligns with the jargon.
So:
> "Discovering statistically significant biclusters in gene expression data"
becomes:
> "Discovering statistically detectable biclusters in gene expression data"
This rephrasing makes it evident that "statistically detectable" adds little to the title. So the title becomes
> "Discovering biclusters in gene expression data"
The article tends to imply P-value should not be used at all, rather than misused. P-value definitely means something. For example, if the p-value is 1e-10 (which is often possible), you know for sure that the hypothesis generating has been disproved.
So let me rephrase the title of the article - "It's time to use P-Value correctly."
Wow, I remember having these reservations about p-values when I took classes in stats but whenever I brought them up a prof. would wave their hands and be dismissive. They gave me a degree in political science, but I felt that political science was an oxymoron and it left me with no respect for the field.
> They gave me a degree in political science, but I felt that political science was an oxymoron and it left me with no respect for the field.
For what it's worth, Andrew Gelman (quoted in the article) is one of the most pre-eminent Bayesian statisticians alive, and is a professor in both the department of Statistics and Political Science!
"Political Science" need not be an oxymoron, even if a lot of self-professed political scientists use rather unscientific methods.
In theory, you are correct, but some data sets are too small or make it too difficult to cross validate in a way, that is meaningful to the original problem
This is one of the best conversation threads I've ever seen on HN. It's both polite and informative.
I want to toss in my own thoughts here.
Since I've spent the vast majority of my tech career in the Market Research industry (hello, bias!), I'm tempted to say that one of the most frequent intersections between statistical science and business decisions happens in that world.
Product testing, shopper marketing, A/B testing . . . these are pretty common fare these days. But I feel like the MR people are sort of their own worst enemy in many cases.
It's a fairly recent development that MR people are even allowed a seat at the table for major product or business decisions. And when the data nerds show up at the meeting, we have to make human communication decisions that are difficult.
I can't show up at the C-suite and lecture company executives about the finer points of statistical philosophy. When I'm presenting findings to stake-holders, it's my job to abstract the details and present something that makes a coherent case for a decision, based on the data we have available.
It is sinfully attractive to go tell your boss's boss's boss that we have a threshold--a number we can point to. If this number turns out to be smaller than .05, this project is a go.
Three months later, you go back to that boss and tell him the number came back and it was .0499999. The boss says, "Okay, go!" And then you are all, "Wait, wait, wait. Hang on a second. Let's talk about this."
My god, what have I done?
The practical reality of the intersection of statistics and business is a harsh one. We have to do better. In terms of leaky abstractions, the communication of data science to business decision makers is quite possibly the leaky-est of all.
Why is it so leaky? I have two points about this.
1) Statistics is one of the most existentially depressing fields of study. There is no acceptance; there is no love; there is nothing positive about it. Ever.
Statistics is always about rejection and failure. We never accept or affirm a hypothesis. We only ever reject the null hypothesis or we fail to reject it. That's it.
2) In business, we tend to be very very sloppy about formulating our hypotheses. Sometimes we don't even really think about them at all.
Take a common case for market research. New product testing. We do a rep sample with a decent size (say, 1800 potential product buyers) and we randomly show five different products, one of which is the product the person already owns/uses (because that's called control /s). The other 4 products are variations on a theme with different attributes.
What's the null hypothesis here? Does it ever get discussed?
What's the alternative hypothesis?
The implicit and never-talked-about null is that all things being equal, there is no difference between the distribution of purchase likelihood among all products. The alternative is that there is a real difference on a scale of likely to purchase.
The implicit and intuitive assumption is that there is something about that feature set that drives the difference. (I'm looking at you, Max Diff)
But that's not real. It's not a part of the test. The only test you can do in that situation is to check if those aggregate distributions are different from each other. The real null is that they are the same, and the alternative is that they are different.
All you can do with statistics is tell if two distributions are isomorphic.
Now, who wants to try to explain any of that to your CEO? No one does. Your CEO doesn't want it, you don't want it, your girlfriend doesn't want it. No one wants it.
So we try to abstract, and I feel like we mostly fail at doing a good job of that.
This is getting really long, and I don't want to rant. So to finish up, an idea for more effective uses of data science as it interacts with the business world:
I agree, let's stop talking about p values. Let's work harder and funnel the results of those MR studies into practical models of the business' future. Let's take the research and pipe it into Bayesian expected value models.
Let's stop showing stacked bar charts to execs and expecting them to make good decisions based on weak evidence we got from hypotheses we didn't really think about in the first place.
Some of this might come across as a rant. I hope it is not taken that way. This is a real problem that I've been thinking about for a long time. And I don't mean to step on anyone's toes. I have certainly committed many of the data sins that I'm deriding above.
Edited to add:
The real workings of statistics are unintuitive. I'm not saying that they are wrong. But in working with people for years now, I understand the confusion. It's a psychological problem. Hypotheses are either not really well though out or not considered in an organized way, in my experience.
A hypothesis is not concrete in many practical cases. It's a thought. An idea, perhaps. It's often a thing that floats around in your mind, or maybe you paid some lip service and tossed it into your note-taking app.
Data seem much more real. You download a few gigabytes of data and start working on it. It's quite easy to get confused.
I have real data! This is tangible stuff. Thinking of things properly and evaluating the probability of your data given the hypothesis is hard. Your data seems much more concrete. These are real people answering real questions about X.
Even for people who are really hell-bent on statistical rigor, this is a challenge.
If they held the same meeting 20 times, would they reach the same conclusion in 19 of those meetings?
On a more serious note, I think that the use of the word "significant" to mean "the effect is reasonably likely to exist by some standard" should be abolished.
Webster's 1913 dictionary says:
> Deserving to be considered; important; momentous; as, a significant event.
Statisticians don't use "significant" to mean important at all -- they use it to mean "I could detect it". This is bad when someone publishes a paper saying "I found that some drug significantly reduces such-and-such" -- this could just mean that they did a HUGE study and found that the drug reliably had some completely unimportant effect. It's much worse when it's negated, though. Think about all the headlines that say that some treatment "did not have a significant effect". This is basically meaningless. I could do a study finding that exercise has no significant effect on fitness, for example, by making the study small enough.
A good friend of mine suggested that statisticians replace "significant" with "discernible". So next time someone does a small study, they might find that "eating fat had no discernible effect on weight gain", and perhaps readers would then ask the obvious question, which is "how hard did you look?".
This would also help people doing very good research make less wishy-washy conclusions. For example, suppose that "vaccines have no discernible effect on autism rates". This is probably true in a number of studies, but it's the wrong analysis. If researchers who did these studies had to state the conclusions in a silly manner like that, maybe they'd find a more useful analysis to do.
Hint: doing big studies just so you can fail to find an effect is nonsensical. Instead, do big studies so you can put a tight upper bound on the effect. Don't tell me that vaccines don't have a significant (or discernible) effect on autism. Tell me that, with 99.9% confidence, you have ruled out the possibility that vaccines have caused more than ten autism cases in the entire history of vaccines, and that, most likely, they've caused no cases whatsoever (or whatever the right numbers are).
There is a significant (pun) difference between statistically significant and economically significant. The conclusion in the drug paper conflates the two. In finance, we could find plenty of statistically significant results (e.g. small cap stocks outperform large cap stocks on Fridays, with a small p-value if you like), but most results were not economically significant--they were not usable for a trading system because they were too small to overcome real-world costs. In short, they weren't meaningful in a real world sense, even though the result was detectable statistically.
That's the second definition in Webster's 1913 edition. The first is:
> Fitted or designed to signify or make known something; having a meaning; standing as a sign or token; expressive or suggestive; as, a significant word or sound; a significant look.
It seems to me that this is the sense in which statisticians talk about significance. It means that the results actually signify something rather than just being meaningless noise.
There is a difference between effect size and significance. Things can be extremely statistically significant and have very small effect sizes.
But the problem is less here and more that people don't care to understand the models they are using to judge statistical significance. A p-value is simply a magic wand to wave over the data and bless it. Statisticians may tend to look at the data a lot more qualitatively - a p-value might tell you something, but much more important is: "How accurately have I managed to model this system?"
This is the larger problem lurking behind "p-hacking" and other colossal statistical fuck-ups: people don't understand the mathematical models they are applying, the limitations of the data, and often don't care to, as long as the veneer of having 'done something' can be applied.
This, again, is probably a product of people cranking out shitty papers to make sure that they keep publishing, to continue eking out grants; which, again, is probably a product of research science being generally underfunded for the demands placed on it.
While I agree that 'significant' has a misleading connotation, 'discernible' is also misleading. 'Statistically significant' just isn't any everyday concept, and trying to phrase it as one will encourage people to make mistakes. It's a complex concept: if we tried this experiment under a certain null hypothesis, then it'd be at least this improbable to see a result at least this extreme. The most I'd be willing to cut it down, after so much confusion in its actual use, is "subjunctively improbable", with the null hypothesis and the threshold left implicit. "Eating fat had no subjunctively improbable effect on weight gain." This sounds technical and fiddly, which I think is a feature: if you don't like it, don't base your reporting on a technical, fiddly concept.
"Eating fat had no discernible effect on weight gain" sounds like getting evidence against such an effect, but it's compatible with getting evidence in favor, that's just not as strong as some threshold. That evidence could be useful in a meta-analysis, or for a decision when waiting for more information isn't practical or economical, or if the potential gain from trying the nonsignificant treatment is high and the potential loss low. (I've seen "no significant X" abused this way. Nobody should try X -- it's unscientific!)
>Hint: doing big studies just so you can fail to find an effect is nonsensical. Instead, do big studies so you can put a tight upper bound on the effect. Don't tell me that vaccines don't have a significant (or discernible) effect on autism. Tell me that, with 99.9% confidence, you have ruled out the possibility that vaccines have caused more than ten autism cases in the entire history of vaccines, and that, most likely, they've caused no cases whatsoever (or whatever the right numbers are).
Checking the bounds of a 100(1-a)% confidence interval is exactly equivalent to checking for a p-value below a%.
5% (1 in 20) is a pretty weak threshold to pass. Let's go 5 sigma (p < 3e-7) for discoveries and reserve 0.05 < p < 3e-7 for stuff we should take closer looks at.
> 5% (1 in 20) is a pretty weak threshold to pass. Let's go 5 sigma (p < 3e-7) for discoveries and reserve 0.05 < p < 3e-7 for stuff we should take closer looks at.
This would still end up leading to misuse of P-values. Let's say you're doing a genome-wide association study on several hundred thousand SNPs. The traditional threshold is 5e-8 (0.05 / 1,000,000 effective tests). So using 3e-7 for the threshold for "discovery", you'd count many things as discovery that shouldn't be so.
On the other hand, let's say you do a study with 20 people with cancer. You give 10 of them a drug, the other 10 a placebo. All 10 with the drug survive; all 10 with the placebo die. Your P value is 0.0002. This doesn't count as discovery, but clinically I know what my judgment is going to be.
This is all to say that the misuse of P-values does not just come from the threshold.
That's fine, but then your power is low, and nearly every result you get will be an exaggeration. The more stringent your p value threshold, the more dramatic your results must be to be significant; if your sample size isn't adequate, you'll only get significance if you overestimate the effect.
This is an enormously common problem even with current p value thresholds. It's part of the reason why you see dramatic "A causes B!" results followed by replications saying "well, only a little."
I once obtained a p-value of zero (or more accurately, smaller than the numerical precision of a p-value in R) for a result that was, by design, meaningless.
It's a bad idea to use p-values as thresholds, regardless of where you put the threshold.
[+] [-] tokenadult|10 years ago|reply
"What is a p-value?
"Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.
"Principles
"1. P-values can indicate how incompatible the data are with a specified statistical model.
"2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
"3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
"4. Proper inference requires full reporting and transparency.
"5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
"6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis."
[1] "The ASA's statement on p-values: context, process, and purpose"
http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016....
[+] [-] thearn4|10 years ago|reply
The testing of statistical hypotheses always seemed like an odd area of the mathematical sciences to me, even after later taking a graduate mathematical statistics sequence. Like an academic squabble between giants in the field of frequentist inference (Fisher vs. Neyman and Pearson) that ended suddenly without resolution, with the scientific community decided to sloppily merge the two positions for the purposes of publication and forge onward.
[+] [-] wfunction|10 years ago|reply
Sure, but is there not a significant correlation between the two in practice? Or would you trust something that gives a 1% p-value equally as one that gives a 99% p-value?
(Yes, I realize it's easy to construct counterexamples, hence why I asked "in practice".)
[+] [-] wdewind|10 years ago|reply
There is no lower threshold at which the data becomes non-predictive?
[+] [-] btilly|10 years ago|reply
Here is what probability theory teaches us. The proper role of data is to adjust our prior beliefs about probabilities to posterior beliefs through Bayes' theorem. The challenge is how to best communicate this result to people who may have had a wide range of prior beliefs.
p-values capture a degree of surprise in the result. Naively, a surprising result should catch our attention and cause us to rethink things. This is not a valid statistical procedure, but it IS how we naively think. And the substitution of a complex question for a simpler one is exactly how our brains are set up to handle complex questions about our environment. (I'm currently working through Thinking Fast and Slow which has a lot to say about this.)
Simple Bayesian approaches take the opposite approach. You generally start with some relatively naive prior, and then treat the posterior as being the conclusion. Which is not very realistic if the real prior was something quite different.
Both approaches have a fundamental mistake. The mistake is that we are taking a data set and asking what it TELLS us about the world. When in probability theory the real role of data is how to UPDATE our views about the world.
This is why I have come to believe that for simple A/B testing, thinking about p-values is a mistake. The only three pieces of information that you need is how much data you are willing to collect, how much you have collected, and how big the performance difference is. Stop either when you have hit the maximum amount of data you're willing to throw at the test, or when the difference exceeds the square root of that maximum amount. This is about as good as any simple rule can do.
If you try to be clever with p-values you will generally wind up putting yourself at saving yourself some effort in return for a small risk per test of making very bad mistakes. Accepting small risk per test over many tests for a long time puts you at high odds of eventually making a catastrophic mistake. This is a very bad tradeoff.
I've personally seen a bad A/B test with a low p-value rolled out that produced a 15% loss in business for a company whose revenues were in the tens of millions annually. It. Was. Not. Pretty. (The problem was eventually found and fixed..a year later and after considerable turnover among the executive team.)
[+] [-] johan_larson|10 years ago|reply
I remember talking to one person in marketing who ran surveys of the company's users. They would send out a survey to all registered users, get back responses from 1% of them or something, and then proceed to report findings based on the responses. They were really happy, since a 1% response rate is great for surveys like this.
I tried to explain to them that all of this statistical machinery relies on having a random sample, and a self-selected sample is not that. No effect whatsoever. Surveys like this are standard practice in the industry. Why are you making trouble, geek-boy?
[+] [-] Houshalter|10 years ago|reply
You can convert p-values to likelihood ratios, and they are quite similar. But its not perfect. A p value of 0.05 becomes 100:5, or 20:1. Which means it increases the odds of a hypothesis by 20. So a probability of 1% updates to 17%, which is still quite small.
But that assumes that the hypothesis has a 100% chance of producing the same or greater result, which is unlikely. Instead it might only be 50% which is half as much evidence.
In the extreme case, it could be only 5% likely to produce the result, which means the likelihood ratio is 5:5 and is literally no evidence, but still has a p value of 0.05.
Anyway likelihood ratios accumulate exponentially, since they multiply together. As long as there is no publication bias, you can take a few weak studies and produce a single very strong likelihood update.
[+] [-] chimeracoder|10 years ago|reply
Glad to see this is gaining traction! I've been saying this for years: the world would actually be in a better place if we just abandoned p-values altogether.
Hypothesis testing is taught in introductory statistics courses because the calculations involved are deceptively easy, whereas more sophisticated statistical techniques would be difficult without any background in linear algebra or calculus.
Unfortunately, this enables researchers in all sorts of fields to make incredibly spurious statistical analyses that look convincing, because "all the calculations are right", even though they're using completely the wrong tool.
Andrew Gelman, quoted in the article, feels very strongly that F-tests are always unnecessary[0]. I'd go as far as to extend that logic to the Student's t-test and any other related test as well.
You can get into all sorts of confusing "paradoxes" with p-values. One of my favorites:
Alice wants to figure out the average height of population. Her null hypothesis is 65 inches. She conducts a simple random sample, performs a t-test, and determines that the sample mean is 70 inches, with a p-value of .01.
In an alternate universe, Bob does the same thing, with the same null hypothesis (65 inches). He determines that the sample mean is 90 inches, with a p-value of .000001.
Some questions:
A) Does Bob's experiment provide stronger evidence for rejecting the null hypothesis than Alice's does?
B) In Bob's universe, is the true population mean higher than it is in Alice's universe?
By pure hypothesis testing alone, the correct answer to both questions is "no", even though the intuitive answer to both questions is "yes"[1].
[0] http://andrewgelman.com/2009/05/18/noooooooooooooo/
[1] Part of the problem is that we do expect that, in Bob's universe, the true population mean is highly likely to be higher, and this is supported by the data. Trouble is, the reason we expect that is not formally related to hypothesis testing and t-tests/p-values.
[+] [-] eanzenberg|10 years ago|reply
[+] [-] bunderbunder|10 years ago|reply
If your null hypothesis amounts to little more than a scalar value that exists in a vacuum, you haven't really got a null hypothesis. If your null hypothesis is that your results will match what was seen in some other data set collected by some other person who may or may not have been using the same protocol as you, your null hypothesis describes a parallel universe and there's no way to draw an apples-to-apples comparison to it and your alternative hypothesis, which concerns data that did not come from that parallel universe.
So I guess my answer to all questions would be, "Mu."
[+] [-] statsaresimple|10 years ago|reply
[+] [-] btilly|10 years ago|reply
Scenario 1 (true). Bill and Lorena had 7 children. 6 were boys, 1 was a girl. Are they biased towards having one gender? A 2-sided p-value says that there are 16 possibilities of this strength or more, each of which has probability 1/2^7, for a p-value of 2^4/2^7 = 1/2^3 = 0.125. We therefore fail to reject the null hypothesis at a p-value of 0.05.
Scenario 2 (also true): Bill and Lorena decided to have children until they had a boy and a girl. They had 6 boys then a girl. Are they biased towards having one gender? An event this unlikely could only happen if they had 6 of the same gender in a row, which is 2 possibilities of probability 1/2^6, for a probability of 2/2^5 = 1/2^5 = 0.03125. We therefore reject the null hypothesis at a p-value of 0.05.
Now the Bayesian gotcha. According to Bayes' theorem, the intent of Bill and Lorena can have absolutely NO impact on ANY calculation of posterior probabilities from prior expectations. There is no logical way in which this fact should matter at all. And yet it did!
[+] [-] thinkmoore|10 years ago|reply
Since her hypothesis test is designed to help answer the question of whether or not the mean population height is 65 inches, why should we expect it to tell us anything about the mean population height other than whether or not it being 65 is consistent with the data observed?
[+] [-] kgwgk|10 years ago|reply
Edit: and why would be the answer to the second question "no"? The hypothesis testing procedure doesn't provide any point estimate at all so the question doesn't really mean anything in that setting.
[+] [-] tryitnow|10 years ago|reply
Furthermore, I can recall many times when my stats professors would make it very clear what the "right" answer was, despite it being counter-intuitive but without explaining
[+] [-] statsaresimple|10 years ago|reply
[deleted]
[+] [-] darawk|10 years ago|reply
I suppose 'chance' is a little hand-wavy, but isn't a p-value just the probability of your data given that your hypothesis is false? Isn't that literally and precisely the probability that they occurred by chance?
[+] [-] jpeterson|10 years ago|reply
No, it's the probability of a particular observation, given that we assume the result is due to chance. This sounds similar, but the difference is that it doesn't say anything about the probability of your result outside the context of the study's hypotheses.
[+] [-] aidenn0|10 years ago|reply
It's the probability of you seeing your results due to chance if there is no effect.
This differs from the probability of the results being due to chance because it does not take into account the probability of your hypothesis being true or false.
If you observe something that would disprove e.g. General Relativity with a P of 0.001 it is much more likely to be due to chance than if you observe something that is consistent with known science with a P of 0.001, as the weight of all the evidence for General Relativity is very strong.
[+] [-] jkyle|10 years ago|reply
> Is the p-value really not the probability of your results being due to chance?
No, it is not. It is the probability that the statistical attributes of the data would be equal or more extreme than observed assuming the Null Hypothesis is true.
In your definition, there is no assumption of what model the data should conform to...so what does "by random" mean in that context? Also, by random doesn't mean 'not predictable' or useless. If I roll a 100 sided die an infinite number of times, I'd expect the number of observances of '5' to approach .01 of the total distribution. So, the probability of rolling a 5 by random chance is 1 in 100. However, I would not reject my Null Hypothesis since my model predicts exactly this random behavior.
Now, if I rolled a die 1000 times and rolled a 5 every time the mean of that distribution (5) would be very, very far from the expected mean of my model if the Null is assumed to be true. And I may be tempted (very) to reject the Null that I am rolling a 100 sided fair die.
I will now sit back and wait for my definition and analogy to be torn to shreds. :)
[+] [-] capnrefsmmat|10 years ago|reply
In conditional probability notation, it's the difference between P(result | it's just chance) and P(it's just chance | result).
[+] [-] haberman|10 years ago|reply
You can't actually say unless you either (1) roll the die more times, or (2) assume something about the probability that I gave you an all-7's die to begin with.
Doing (2) is useless, because that exactly the question we are trying to answer.
For example, suppose I perform this experiment all the time and I know that I give an all-7's die only 1% of the time. With this new information, you could actually calculate the probability of an all-7's die given a 7 roll. Of all the possible outcomes, you could add up the ones where I gave you an all-7s die and the ones where I gave you a normal die but you just rolled 7. Then you could divide that by the total number of possible outcomes.
But this would give you a totally different number than if I give you an all-7's die 99% of the time. And the problem is that you don't have any information about what kind of die you have before you roll it. You're trying to figure out which world we live in -- one where your hypothesis is true or one where it's not.
(I am pretty sure that what I wrote above is true. But one thing I'm not as clear on is how multiple rolls of the die actually can establish confidence percentages. How many rolls does it take to actually establish confidence? Would love to hear from any stats experts about that.)
[+] [-] chimeracoder|10 years ago|reply
It is the probability of achieving a result at least as far from the hypothesized value exclusively due to random variation that is uncorrelated with the explanatory variable(s) at hand [and subject to a number of other assumptions].
[+] [-] jules|10 years ago|reply
[+] [-] unknown|10 years ago|reply
[deleted]
[+] [-] irremediable|10 years ago|reply
* Ok, a chosen null hypothesis, often chosen to be something like "chance".
[+] [-] unknown|10 years ago|reply
[deleted]
[+] [-] FrankyHollywood|10 years ago|reply
Really, the amazing amount of bullshit social studies I have seen 'proven' by statistics. Amazing new insights like 'if children wear green shirts, while the teacher has a blue shirt the cognitive attention span is 12.3% higher than children wearing purple shirts. The effect was measured with a significance of bla bla bla.'
Software like SPSS facilitates this even more. People with no notion of random effects or probablilty theory click on the 'proof my research' button and even get it published.
So a lot more work in this area!
[+] [-] rcthompson|10 years ago|reply
[+] [-] carbocation|10 years ago|reply
I am glad that the article's headline is clear that it's time to stop misusing P-values. Tests for statistical significance should still be used, and to abandon them would be foolish. In a sense, though, they are the beginning, not the end, of assessment.
[+] [-] eanzenberg|10 years ago|reply
[+] [-] benjaminmhaley|10 years ago|reply
So:
> "Discovering statistically significant biclusters in gene expression data"
becomes:
> "Discovering statistically detectable biclusters in gene expression data"
This rephrasing makes it evident that "statistically detectable" adds little to the title. So the title becomes
> "Discovering biclusters in gene expression data"
A better title.
[+] [-] giardini|10 years ago|reply
"Mindless Statistics"
http://library.mpib-berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf
or
http://www.unh.edu/halelab/BIOL933/papers/2004_Gigerenzer_JS...
[+] [-] altrego99|10 years ago|reply
[+] [-] daodedickinson|10 years ago|reply
[+] [-] chimeracoder|10 years ago|reply
For what it's worth, Andrew Gelman (quoted in the article) is one of the most pre-eminent Bayesian statisticians alive, and is a professor in both the department of Statistics and Political Science!
"Political Science" need not be an oxymoron, even if a lot of self-professed political scientists use rather unscientific methods.
[+] [-] purpled_haze|10 years ago|reply
[+] [-] mikeskim|10 years ago|reply
[+] [-] vasilipupkin|10 years ago|reply
[+] [-] ianamartin|10 years ago|reply
I want to toss in my own thoughts here.
Since I've spent the vast majority of my tech career in the Market Research industry (hello, bias!), I'm tempted to say that one of the most frequent intersections between statistical science and business decisions happens in that world.
Product testing, shopper marketing, A/B testing . . . these are pretty common fare these days. But I feel like the MR people are sort of their own worst enemy in many cases.
It's a fairly recent development that MR people are even allowed a seat at the table for major product or business decisions. And when the data nerds show up at the meeting, we have to make human communication decisions that are difficult.
I can't show up at the C-suite and lecture company executives about the finer points of statistical philosophy. When I'm presenting findings to stake-holders, it's my job to abstract the details and present something that makes a coherent case for a decision, based on the data we have available.
It is sinfully attractive to go tell your boss's boss's boss that we have a threshold--a number we can point to. If this number turns out to be smaller than .05, this project is a go.
Three months later, you go back to that boss and tell him the number came back and it was .0499999. The boss says, "Okay, go!" And then you are all, "Wait, wait, wait. Hang on a second. Let's talk about this."
My god, what have I done?
The practical reality of the intersection of statistics and business is a harsh one. We have to do better. In terms of leaky abstractions, the communication of data science to business decision makers is quite possibly the leaky-est of all.
Why is it so leaky? I have two points about this.
1) Statistics is one of the most existentially depressing fields of study. There is no acceptance; there is no love; there is nothing positive about it. Ever.
Statistics is always about rejection and failure. We never accept or affirm a hypothesis. We only ever reject the null hypothesis or we fail to reject it. That's it.
2) In business, we tend to be very very sloppy about formulating our hypotheses. Sometimes we don't even really think about them at all.
Take a common case for market research. New product testing. We do a rep sample with a decent size (say, 1800 potential product buyers) and we randomly show five different products, one of which is the product the person already owns/uses (because that's called control /s). The other 4 products are variations on a theme with different attributes.
What's the null hypothesis here? Does it ever get discussed?
What's the alternative hypothesis?
The implicit and never-talked-about null is that all things being equal, there is no difference between the distribution of purchase likelihood among all products. The alternative is that there is a real difference on a scale of likely to purchase.
The implicit and intuitive assumption is that there is something about that feature set that drives the difference. (I'm looking at you, Max Diff)
But that's not real. It's not a part of the test. The only test you can do in that situation is to check if those aggregate distributions are different from each other. The real null is that they are the same, and the alternative is that they are different.
All you can do with statistics is tell if two distributions are isomorphic.
Now, who wants to try to explain any of that to your CEO? No one does. Your CEO doesn't want it, you don't want it, your girlfriend doesn't want it. No one wants it.
So we try to abstract, and I feel like we mostly fail at doing a good job of that.
This is getting really long, and I don't want to rant. So to finish up, an idea for more effective uses of data science as it interacts with the business world:
I agree, let's stop talking about p values. Let's work harder and funnel the results of those MR studies into practical models of the business' future. Let's take the research and pipe it into Bayesian expected value models.
Let's stop showing stacked bar charts to execs and expecting them to make good decisions based on weak evidence we got from hypotheses we didn't really think about in the first place.
Some of this might come across as a rant. I hope it is not taken that way. This is a real problem that I've been thinking about for a long time. And I don't mean to step on anyone's toes. I have certainly committed many of the data sins that I'm deriding above.
Edited to add:
The real workings of statistics are unintuitive. I'm not saying that they are wrong. But in working with people for years now, I understand the confusion. It's a psychological problem. Hypotheses are either not really well though out or not considered in an organized way, in my experience.
A hypothesis is not concrete in many practical cases. It's a thought. An idea, perhaps. It's often a thing that floats around in your mind, or maybe you paid some lip service and tossed it into your note-taking app.
Data seem much more real. You download a few gigabytes of data and start working on it. It's quite easy to get confused.
I have real data! This is tangible stuff. Thinking of things properly and evaluating the probability of your data given the hypothesis is hard. Your data seems much more concrete. These are real people answering real questions about X.
Even for people who are really hell-bent on statistical rigor, this is a challenge.
[+] [-] amluto|10 years ago|reply
On a more serious note, I think that the use of the word "significant" to mean "the effect is reasonably likely to exist by some standard" should be abolished.
Webster's 1913 dictionary says:
> Deserving to be considered; important; momentous; as, a significant event.
Statisticians don't use "significant" to mean important at all -- they use it to mean "I could detect it". This is bad when someone publishes a paper saying "I found that some drug significantly reduces such-and-such" -- this could just mean that they did a HUGE study and found that the drug reliably had some completely unimportant effect. It's much worse when it's negated, though. Think about all the headlines that say that some treatment "did not have a significant effect". This is basically meaningless. I could do a study finding that exercise has no significant effect on fitness, for example, by making the study small enough.
A good friend of mine suggested that statisticians replace "significant" with "discernible". So next time someone does a small study, they might find that "eating fat had no discernible effect on weight gain", and perhaps readers would then ask the obvious question, which is "how hard did you look?".
This would also help people doing very good research make less wishy-washy conclusions. For example, suppose that "vaccines have no discernible effect on autism rates". This is probably true in a number of studies, but it's the wrong analysis. If researchers who did these studies had to state the conclusions in a silly manner like that, maybe they'd find a more useful analysis to do.
Hint: doing big studies just so you can fail to find an effect is nonsensical. Instead, do big studies so you can put a tight upper bound on the effect. Don't tell me that vaccines don't have a significant (or discernible) effect on autism. Tell me that, with 99.9% confidence, you have ruled out the possibility that vaccines have caused more than ten autism cases in the entire history of vaccines, and that, most likely, they've caused no cases whatsoever (or whatever the right numbers are).
Edit: fixed an insignificant typo.
[+] [-] evolsb|10 years ago|reply
[+] [-] chc|10 years ago|reply
> Fitted or designed to signify or make known something; having a meaning; standing as a sign or token; expressive or suggestive; as, a significant word or sound; a significant look.
It seems to me that this is the sense in which statisticians talk about significance. It means that the results actually signify something rather than just being meaningless noise.
[+] [-] astazangasta|10 years ago|reply
But the problem is less here and more that people don't care to understand the models they are using to judge statistical significance. A p-value is simply a magic wand to wave over the data and bless it. Statisticians may tend to look at the data a lot more qualitatively - a p-value might tell you something, but much more important is: "How accurately have I managed to model this system?"
This is the larger problem lurking behind "p-hacking" and other colossal statistical fuck-ups: people don't understand the mathematical models they are applying, the limitations of the data, and often don't care to, as long as the veneer of having 'done something' can be applied.
This, again, is probably a product of people cranking out shitty papers to make sure that they keep publishing, to continue eking out grants; which, again, is probably a product of research science being generally underfunded for the demands placed on it.
[+] [-] abecedarius|10 years ago|reply
"Eating fat had no discernible effect on weight gain" sounds like getting evidence against such an effect, but it's compatible with getting evidence in favor, that's just not as strong as some threshold. That evidence could be useful in a meta-analysis, or for a decision when waiting for more information isn't practical or economical, or if the potential gain from trying the nonsignificant treatment is high and the potential loss low. (I've seen "no significant X" abused this way. Nobody should try X -- it's unscientific!)
[+] [-] ryanmonroe|10 years ago|reply
Checking the bounds of a 100(1-a)% confidence interval is exactly equivalent to checking for a p-value below a%.
[+] [-] eanzenberg|10 years ago|reply
[+] [-] carbocation|10 years ago|reply
This would still end up leading to misuse of P-values. Let's say you're doing a genome-wide association study on several hundred thousand SNPs. The traditional threshold is 5e-8 (0.05 / 1,000,000 effective tests). So using 3e-7 for the threshold for "discovery", you'd count many things as discovery that shouldn't be so.
On the other hand, let's say you do a study with 20 people with cancer. You give 10 of them a drug, the other 10 a placebo. All 10 with the drug survive; all 10 with the placebo die. Your P value is 0.0002. This doesn't count as discovery, but clinically I know what my judgment is going to be.
This is all to say that the misuse of P-values does not just come from the threshold.
[+] [-] capnrefsmmat|10 years ago|reply
This is an enormously common problem even with current p value thresholds. It's part of the reason why you see dramatic "A causes B!" results followed by replications saying "well, only a little."
http://www.statisticsdonewrong.com/regression.html#truth-inf...
[+] [-] Fomite|10 years ago|reply
It's a bad idea to use p-values as thresholds, regardless of where you put the threshold.