top | item 14243950

Same Stats, Different Graphs: Datasets with Varied Appearance and Identical Stats

126 points| bryceroney | 9 years ago |autodeskresearch.com | reply

23 comments

order
[+] michaelmior|9 years ago|reply
I don't think the title does a great job of identifying how cool this is. Having the animations especially is great!
[+] jmatejka|9 years ago|reply
Thanks! I'm the author of the paper, glad you like it, and especially glad that you think the animations are cool :-)
[+] bryceroney|9 years ago|reply
It was hard to fit a title into 80 characters on my first HN post :)
[+] h1ckb|9 years ago|reply
I agree, very cool.
[+] wyldfire|9 years ago|reply
> Boxplots are commonly used to show the distribution of a dataset, and are better than simply showing the mean or median value. However, here we can see as the distribution of points changes, the box-plot remains the same.

Violin plots [1][2] are a great spin on boxplots that help show the distribution.

[1] https://en.wikipedia.org/wiki/Violin_plot

[2] http://seaborn.pydata.org/generated/seaborn.violinplot.html

[+] SamBam|9 years ago|reply
But a violin plot still wouldn't distinguish between many of the plots in this set. All of the plots that don't have multiple vertical stacks -- so, all the horizontal lines, diagonal lines, circles, and scatter -- will have the same or almost the same violin plot.

You could rotate the violin plot to show how the density changes on the other axis, for those graphs where that would be useful, but that requires looking at the data visually and making decisions, which is the whole point of the article.

[+] pmoriarty|9 years ago|reply
I'm wondering how the general public can be educated about this when so many people are unaware even of the difference between the median and the mean.

In the mass media especially, the mean is often bandied about as the only statistic, and treated as if it was definitive.

[+] rodionos|9 years ago|reply
I think the public is largely aware of the difference since housing prices are often reported using the median. Intuitively, everyone knows that median housing price is the price of a 'regular' (most frequently sold) house in a given area.
[+] eggie5|9 years ago|reply
You see this a lot in the context of building a regression and often times the assumptions are violated:

* Linear relationship between predictor and response and no multicollinearity * No auto-correlation (statistical independence of the errors) * Homoscedasticity (constant variance) of the errors * Normality of the residual(error) distribution.

As the paper suggests, plotting the data visually will help you avoid these assumptions, but also just making sure you don't violate the assumptions w/ statistical tests would work too. For example, uou can look at your residuals (loss) as an indicator of good fit. If your residuals do not follow a normal distribution, this is typically a warning sign that your R2 score is dubious.

There are a few statistical tests for Residual Normality, particularly, the Jaque-Bara test is common and available in scipy.

So, I would argue, you don't even need to visualize the data. I describe this more here: http://www.eggie5.com/104-linear-regression-assumptions

[+] christopheraden|9 years ago|reply
Sure, using hypothesis tests could pick out some of the structured examples in the Datasaurus, but in practice, things are often more subtle. Goodness of Fit tests to check for normality, in particular, are a little bit thorny, lacking power in small sample sizes, and rejecting normality for slight departures in higher sample sizes. My experience has been with assumption checking that by the time a hypothesis test has sufficient evidence to reject an assumption, you'd usually be able to see it visually.

Until you get into high dimensions, it probably doesn't hurt too much to visualize the data. Additionally, it can be helpful to understand what signal has been left in the residuals (ex: you fit a linear model, but failed to include a quadratic term), which is something hypothesis tests aren't as good at telling you.

[+] crb002|9 years ago|reply
Good use case for n-dimensional probability density functions. Query a region by a hypercube.
[+] squeakynick|9 years ago|reply
Awesome article. I just read with my morning coffee, and I have a big grin.
[+] jmatejka|9 years ago|reply
Thanks! Glad you liked it :-)