top | item 3346743

Torture your data long enough and it'll tell you anything

129 points| DanielRibeiro | 14 years ago |businessweek.com | reply

34 comments

order
[+] refurb|14 years ago|reply
As a scientist who has moved into the business world, it amazes me how statistics are abused.

When I was conducting scientific research, the goal was to come up with an air-tight (or as air-tight as possible) case for your hypothesis. If you presented your findings at a meeting, you better be prepared for the onslaught of questions like "Did you consider X?" and "What about Y?".

Then I moved to the business side and holy crap are the standards lower. Of course it's easier to prove something in a lab than in the real world, but in so many cases I've seen somebody say "If you do X, you will get Y result, based on the data I analyzed". Then I raise my hand and say "But what about Z? That could explain your results." and all I get is blank stares like I just solved a differential wave function in my head.

[+] rexf|14 years ago|reply
Google Correlate could be a fun tool to find arbitrary correlation

http://www.google.com/trends/correlate/draw

[+] alexchamberlain|14 years ago|reply
That is a really cool tool. I will definitely be using it to prove some crazy statements.
[+] stfu|14 years ago|reply
Thanks so much for posting this! The perfect tool to irritate people with odd correlations. Reminds me on the good old Church of the Flying Spaghetti Monster "believe" in the correlation between Pirates' decline and global warming.
[+] DanielBMarkham|14 years ago|reply
This subject must be in the air. This article came out, I just blogged on the same general topic (http://www.whattofix.com/blog/archives/2011/12/management-by...), and just a minute ago I got through reading another article on cognitive bias. http://www.american.com/archive/2011/december/the-political-...

It's a great topic, especially for businesses and startups. I think the problem is much, much deeper than correlation != causation. The basic problem is that we don't understand how to deal with statistics, especially aggregate numbers. This is a funny way to make a point, but the problem is waaaay deeper than just confusion about correlation. The errors in scientific studies, for instance, are just one example of the harm caused by these kinds of cognitive blind spots. (I say blind spot instead of lack of math education because I don't believe the root problem is a lack of understanding math. In my opinion, something else is going on.)

[+] mwexler|14 years ago|reply
All these comments are true. But let's not make the similar error to say that correlations are bad.

Sometimes, knowing what caused something is the necessary answer, and for those, a root cause analysis and proper experimental design for validation are important. But sometimes, in business and in life, just knowing that things hang together can be pretty handy.

Correlations are important clues. The entire "recommendation" world, from Amazon's collaborative filtering to Hunch's "everything you might be interested in" are all predicated on correlations.

No argument, saying correlation implies causation is bad. But it's just as bad to say "therefore, correlation is bad". DanielBMarkham's article and this BW.com post both show that it comes down to interpretation of what the data says. It's understanding the limitations of what a number, or a trend, or even a distribution can reveal. It's understanding what regression to the mean actually means, or why we consider a distribution "normal"... and that outliers actually can be profitable.

And it's a recognition that with the democratization of big data, it will get worse before it gets better... but it will get better. 40 years ago, no one ever saw the stock market on the news, or had access to it's ups and downs every second. We now all have a better understanding of stocks (well, ok, that's a bit of a stretch, but you get my drift), and their dangers. Similarly, as we get used to seeing lots more data, and discovering that if you interpret it wrong, bad things happen... well, I expect more folks to ask that next level of questions. Not all, and not much past that... but it will be a start.

[+] forrestthewoods|14 years ago|reply
"There are three kinds of lies: lies, damned lies, and statistics."

I'd go so far as to say the problem is 100x more complicated than "Correlation != Causation". Given a set of factual statistics it's not terribly difficult to present them in a truthful, reasonable manner than support any side of a given argument.

[+] klodolph|14 years ago|reply
Well, most people seem to forget that if you are looking for correlations among N variables, you can't compare each pair with the same standards as if you only had 2 variables. (Remember the recent article about neuroscience papers? Same thing.)

So the damned lie of statistics is pretty subtle, you just have to omit the number of variables you actually looked at when you present your data.

[+] Sukotto|14 years ago|reply
I wish schools taught math leading to statistics and probability instead of leading to calculus. I believe that would much more useful for the average citizen.
[+] dxbydt|14 years ago|reply
> wish schools taught math leading to statistics and probability instead of leading to calculus

This is silly. All probability distributions are cadlag, so how can you even teach probability without the notion of right continous with left limits, which means you have to resort to limits & derivatives => Calc.

Actually, the argument for combining Calc & Stats is very compelling, because there is too much synergy. How can you teach a continous probability distribution like say the Gaussian without teaching how to integrate under the curve for the cumulative distribution function, or obtaing the probability density function via the derivative, or obtaining the variance aka second central moment via the moment generating function, which means you now have to teach atleast some fourier transforms which again means Calculus. At both UChicago & Stanford where I learnt all of my probability, calculus was quite intertwined with the teaching of probability. I believe its the same case in most other schools as well.

Without calc in probability, you can do "lame" stuff like discrete distributions ( Binomial, Poisson etc....but even there, the key insight is to show how the CDFs of the discrete distributions, which will generally have terribly complicated formulae with giant factorial expressions, can be very nicely approximated by the continous distributions for large n, small p etc. ( aka continous correction http://en.wikipedia.org/wiki/Continuity_correction ). So for a large number of coin flips trials, you use a Normal to approximate the CDF because otherwise the original binomial CDF is too hard to compute with your TI-84s (because you have one giant factorial divided by another giant factorial and the numerical overflows will kill the computation unless you are very careful about how you go about computing the result).

My favorite go-to guide remains the excellent Calc & Stat Dover book ( http://www.amazon.com/Calculus-Statistics-Dover-Books-Mathem... ), which combines Calc & Stats from page 1. There is simply no better way to learn stats than via calc.

[+] stygianguest|14 years ago|reply
In fact, this is the origin of the verb data-mining: to find whatever you need in data. Funny how it changed from a derogative to a respected --or at least well-payed-- practice.
[+] CognitiveLens|14 years ago|reply
You are over-simplifying and therefore trivialize what data-mining is. Data-mining is about deriving fact-based conclusions from complex information as an alternative to making decisions based on intuition or ignorance. Like almost anything complex, it can be done very poorly (as in the OP), or it can be done well. That doesn't mean that it originates in mis-representing information for the sake of 'finding whatever you need'.
[+] tatsuke95|14 years ago|reply
This is a massive exaggeration. Of course you can find correlation over specific periods between random series. But when you're doing real analysis, the series you use aren't random (like the shape of a mountain). The idea is to draw an inference first, then see what the associated data says.

Of course, anyone beyond the base level of wisdom in this field understands this. It just annoys me that people attempt to diminish the value of statistics with an argument like this.

[+] timwiseman|14 years ago|reply
I don't think they are trying to diminish the value of statistics at all, but rather point out that it is easy to misunderstand or even deliberately abuse them.

This is more a warning to people without an understanding of statistics, because most people out there do not have a deep grasp of the fact that correlation does not imply causation.

[+] CognitiveLens|14 years ago|reply
Articles like this tend to elicit an interesting response from people. One one hand, many seem to believe that statistics = deceptive manipulation. One the other, many call for better statistics education in schools. Sometimes, both claims are made by the same individuals. So it seems that statistics education has at least two goals: explain what statistics actually is, and then explain how to do it correctly.
[+] iqster|14 years ago|reply
Minor footnote: I thought the saying was "if you torture your data long enough, it will CONFESS to anything."
[+] _delirium|14 years ago|reply
Fig. 6 is awesome. Clearly too striking a correlation to be merely coincidence.
[+] wtvanhest|14 years ago|reply
I wasn't sure at first, but the gun illustration cemented it as fact.
[+] skore|14 years ago|reply
Not sure whether you're joking, but since this is HN: That clearly ain't an actual mountain range.
[+] alphamale3000|14 years ago|reply
The core idea is true and should be spread around, but their examples lack refinement and subtlety.