Statistical Mistakes and How to Avoid Them

[+] lisper|9 years ago|reply

This is the insight that made statistics "click" for me many years ago: a statistical test answers one central question: what are the odds that the results you observed could have arisen by chance? If those odds are low, then you are justified in concluding that the results probably did not arise by chance, and so there must be some other explanation (usually, but not always, the causal hypothesis you are advancing).

Once consequence of this is that it is crucial that you advance your hypothesis before you collect (or at least look at) the data because the odds of something arising by chance change depending on whether you predict or postdict the results. Also, the more data you have, the more likely you are to find something in there that looks like a signal but is in fact just a coincidence. Many a day-trading fortune has been lost to this one mistake.

[+] fela|9 years ago|reply

Unfortunately no. Very much no, even though it's widely believed that that is a good definition/intuition (and used in many places).

It's the odds of having that results due to chance, if the null hypothesis is true[0]. That latter part might sound pedantic, but the whole point is that we don't know how likely the null hypothesis is. If I test wheather the sun has just died[1] and get a p-value of 0.01 it's still very likely that this result is due to change (surely more than 1%)! We need a prior probability (i.e. bayesian statistics) to calculate the probability that the result was due to chance, that is why that partial definition is incomplete and actually very misleading. This point is subtle, but very important to really understand p-values.

Another way to look at it is: if we knew the probability that the result was due to chance we could also just take 1-p and have to probability of there actually being some effect, a probability that hypothesis testing cannot give us.

There is one nice property that hypothesis testing does have (and why presumably it's so widely used): if the idea you are testing is wrong (which actually means "null hypothesis true") you will most likely (1-p) not find any positive results. This is good, this means that if the sun in fact did not die, and use 0.01 as your threshold, 99% of the experiments will conclude that there is no reason to believe the sun has died. So hypothesis testing does limit the number of false positive findings. The xkcd comic is a bit misleading it this regard, yes it does highlight the limitations of frequentist hypothesis testing, but the scenario depicted is a very unlikely one, in 99% of the cases there would have been a boring and reasonable "No, the sun hasn't died".

For an incredibly interesting article about the difficulty of concluding anything definitive from scientific results I highly recommend "The Control Group is out of Control" at slatestarcodex[2].

[0] To be even more pedantic you would have to add "equal or more extreme", and "under a given model", but "if the null hypothesis is true" is by far the most important piece often missing.

[1] https://xkcd.com/1132/

[2] http://slatestarcodex.com/2014/04/28/the-control-group-is-ou...

[+] stdbrouw|9 years ago|reply

> a statistical test answers one central question: what are the odds that the results you observed could have arisen by chance

Well, no, that'd be very interesting but unfortunately what a statistical test really says is the probability of the results you observed (or more extreme) given chance. P(data|model) and not P(model|data).

[+] skybrian|9 years ago|reply

I'm not a statistician but even so I think this article makes assumptions that may not hold up for computer science. The first thing to do is plot your data. If it doesn't look like a bell curve, it's unlikely that common statistical calculations (which assume something close to gaussian) apply here.

If you're doing benchmarking, another common model is a peak at a minimum value (when everything goes right) and a long tail, due to various events like cache misses that always slow things down, but don't happen in every test run.

On a system with multiple programs running (a typical desktop), taking the mean is meaningless - this just adds noise due to activity unrelated to your program. You'd be better off taking the minimum, which with enough test runs should capture all the events that happen every time and none of the events that don't.

The median or 95% percentile might also be useful if you're investigating events that don't happen every time. But if you want to know about cold start performance (for example), maybe the best thing to do would be to flush your caches before every test run, so the events you're interested in are events that happen every time.

[+] srean|9 years ago|reply

> If it doesn't look like a bell curve, it's unlikely that common statistical calculations (which assume something close to gaussian) apply here.

The key word in there is common. There is an entire industry of statistical techniques that do not require Gaussian assumption or for that matter any parametric assumption.

I strongly feel it is time to retire the Gaussian distribution from the space it occupies. Discovering and studying Gaussian distribution and the bog standard central limit theorem should be considered one of mankind's crowning achievements. They deserve to be put on a pedestal to appreciate their elegance, but when rubber meets the road one has to open ones mind to look beyond. Appearance of the Gaussian distribution is rarely as normal as many expect/claim it to be (I blame the stats education machinery for this), nor was it invented by Gauss. In fact Gauss used it as a post-hoc justification for backing the least-squares method. His original motivation for least squares was simplicity and convenience, not the normal distribution or CLT or for that matter the Gauss-Markov theorem.

[+] fela|9 years ago|reply

"it’s telling you that there’s at most an alpha chance that the difference arose from random chance. In 95 out of 100 parallel universes, your paper found a difference that actually exists. I’d take that bet."

This is wrong. It’s telling you that there’s at most an alpha chance that a difference like that (or more) would have arisen from random chance if the quantities are actually equal. And if the quantities are equal 95 out of 100 parallel universes would not be able to reject the null hypothesis.

Is he saying that he would take the xkcd bet[0] on the frequentist side?

[0] https://xkcd.com/1132/

[+] frozenport|9 years ago|reply

The t-test assumes a normal distribution which, is rarely true, especially when the number of runs is under 100. A better test is the Mann-Whitney U test which is applicable for a wider category of distributions.

[+] rcthompson|9 years ago|reply

I think the t-test is conceptually easier to understand, which is important since the target audience for this article is people who know next to nothing about statistics.

The t-test might not be the best test for every situation, but if the alternative is no test at all, I'll take it.

[+] platz|9 years ago|reply

Central limit theorem means there are lots of cases where normal distributions are directly applicable.

[+] amelius|9 years ago|reply

I don't like how the article tries to push statistics on the reader. If a CS paper compares a pair of averages, then that gives certain information. If statistics can add to that, and make the results a little more precise, then that is nice. But by no means is it absolutely necessary. And statistics will not give a conclusive result either.

I think that authors should use statistics when they see fit, and when it does not distract too much from the original subject of the paper.

[+] samps|9 years ago|reply

Needless to say, I disagree. It can be straight-up misleading to report means without including a more nuanced view of the distribution. You don't need to use a bunch of fancy statistics, but you do need to consider whether your results could have arisen by random chance. That's not a distraction; it's accurately reporting what you found.

Here's one frightening example of spurious performance results in CS: https://www.cis.upenn.edu/~cis501/papers/producing-wrong-dat...

[+] rcthompson|9 years ago|reply

Any time you compare two averages, you are doing statistics, whether or not you report the result in statistical terms. It's not something optional that you can "add on" to provide extra information. If you don't provide some measure of significance, I'm not going to trust that your result has any chance of being real. At best, you don't really know whether the result is real because you ignored the statistics; at worst, you ran the statistics and know it's not real, and you're hoping I won't notice.

[+] esfandia|9 years ago|reply

IMO it's better to just plot the two distributions that you want to compare so that people can eyeball the difference (or run the statistical tests themselves should they wish to do so). For one thing not all distributions are Gaussian. And the t-test only answers one specific question (i.e., with one given p-value). Then there's people who misinterpret the result of the t-test. Or people who mess with the p-value till they get what they need.

[+] ska|9 years ago|reply

This thinking is part of the problem. While an individual average may just be "a fact", you cannot meaningfully compare two averages without knowing more than their values.

Pretending you can has lead to a lot of muddled thinking.

[+] BeetleB|9 years ago|reply

Eh. Let's see how this goes.

Profession A has a mean salary 20% higher than that of Profession B.

Yet people who are in profession A are much more likely to be in poverty than in profession B.

Yet almost any time someone compares two means, they never seem to come to this conclusion - or even consider it a possibility.

Comparing two means without other details is rarely illuminating, and often leads to wrong conclusions (which are worse than no conclusions with no data).

[+] glangdale|9 years ago|reply

The idea that you should 'plot the error bars' ahead of, well, looking at the data seems a bit premature. As many other comments have stated, looking at the data first is critical.

It drives me up the wall: we have 1200dpi printers, retina displays, and so on, and yet somehow people feel the need to collapse everything they've done to these giant finger-painting quality bar charts. Statistical tests are well and good, but I'm amazed at the extent to which smart people will happily plug data which they have never actually seen into statistical metrics. So a mean might be derived from 9 reasonable results and a howlingly off factor-of-2 outlier, and you can dutifully plug this series into a bunch of standard tests and speak confidently about p-values.

[+] unknown|9 years ago|reply

[deleted]

[+] ekianjo|9 years ago|reply

That's a good article but pretty short. There would be a lot more ground to cover.

[+] wodenokoto|9 years ago|reply

Is there a good resource to learn the underpinnings of P values and T-tests?

I feel like everybody says these are important, show a formula and then arguments ensue about what p=0.95 means, and nobody seems to know this.

[+] imh|9 years ago|reply

I think any intro stats book should do the trick. As far as I know, the material in a first stats course is pretty homogeneous. I'm not a biostatistician, but I happen to like this book [0] for introductory stuff. Amazon says you can get it used for $26.

[0] https://www.amazon.com/Principles-Biostatistics-CD-ROM-Marce...

[+] petters|9 years ago|reply

One common mistake is to take the average of run times (for identical runs). I think taking the minimum should be better under reasonable assumptions.

Edit: I now see that this was mentioned elsewhere here. Good!

76 comments