> Boxplots are commonly used to show the distribution of a dataset, and are better than simply showing the mean or median value. However, here we can see as the distribution of points changes, the box-plot remains the same.
Violin plots [1][2] are a great spin on boxplots that help show the distribution.
But a violin plot still wouldn't distinguish between many of the plots in this set. All of the plots that don't have multiple vertical stacks -- so, all the horizontal lines, diagonal lines, circles, and scatter -- will have the same or almost the same violin plot.
You could rotate the violin plot to show how the density changes on the other axis, for those graphs where that would be useful, but that requires looking at the data visually and making decisions, which is the whole point of the article.
I'm wondering how the general public can be educated about this when so many people are unaware even of the difference between the median and the mean.
In the mass media especially, the mean is often bandied about as the only statistic, and treated as if it was definitive.
I think the public is largely aware of the difference since housing prices are often reported using the median. Intuitively, everyone knows that median housing price is the price of a 'regular' (most frequently sold) house in a given area.
You see this a lot in the context of building a regression and often times the assumptions are violated:
* Linear relationship between predictor and response and no multicollinearity
* No auto-correlation (statistical independence of the errors)
* Homoscedasticity (constant variance) of the errors
* Normality of the residual(error) distribution.
As the paper suggests, plotting the data visually will help you avoid these assumptions, but also just making sure you don't violate the assumptions w/ statistical tests would work too. For example, uou can look at your residuals (loss) as an indicator of good fit. If your residuals do not follow a normal distribution, this is typically a warning sign that your R2 score is dubious.
There are a few statistical tests for Residual Normality, particularly, the Jaque-Bara test is common and available in scipy.
Sure, using hypothesis tests could pick out some of the structured examples in the Datasaurus, but in practice, things are often more subtle. Goodness of Fit tests to check for normality, in particular, are a little bit thorny, lacking power in small sample sizes, and rejecting normality for slight departures in higher sample sizes. My experience has been with assumption checking that by the time a hypothesis test has sufficient evidence to reject an assumption, you'd usually be able to see it visually.
Until you get into high dimensions, it probably doesn't hurt too much to visualize the data. Additionally, it can be helpful to understand what signal has been left in the residuals (ex: you fit a linear model, but failed to include a quadratic term), which is something hypothesis tests aren't as good at telling you.
[+] [-] michaelmior|9 years ago|reply
[+] [-] jmatejka|9 years ago|reply
[+] [-] bryceroney|9 years ago|reply
[+] [-] h1ckb|9 years ago|reply
[+] [-] wyldfire|9 years ago|reply
Violin plots [1][2] are a great spin on boxplots that help show the distribution.
[1] https://en.wikipedia.org/wiki/Violin_plot
[2] http://seaborn.pydata.org/generated/seaborn.violinplot.html
[+] [-] SamBam|9 years ago|reply
You could rotate the violin plot to show how the density changes on the other axis, for those graphs where that would be useful, but that requires looking at the data visually and making decisions, which is the whole point of the article.
[+] [-] pmoriarty|9 years ago|reply
In the mass media especially, the mean is often bandied about as the only statistic, and treated as if it was definitive.
[+] [-] rodionos|9 years ago|reply
[+] [-] eggie5|9 years ago|reply
* Linear relationship between predictor and response and no multicollinearity * No auto-correlation (statistical independence of the errors) * Homoscedasticity (constant variance) of the errors * Normality of the residual(error) distribution.
As the paper suggests, plotting the data visually will help you avoid these assumptions, but also just making sure you don't violate the assumptions w/ statistical tests would work too. For example, uou can look at your residuals (loss) as an indicator of good fit. If your residuals do not follow a normal distribution, this is typically a warning sign that your R2 score is dubious.
There are a few statistical tests for Residual Normality, particularly, the Jaque-Bara test is common and available in scipy.
So, I would argue, you don't even need to visualize the data. I describe this more here: http://www.eggie5.com/104-linear-regression-assumptions
[+] [-] christopheraden|9 years ago|reply
Until you get into high dimensions, it probably doesn't hurt too much to visualize the data. Additionally, it can be helpful to understand what signal has been left in the residuals (ex: you fit a linear model, but failed to include a quadratic term), which is something hypothesis tests aren't as good at telling you.
[+] [-] murkle|9 years ago|reply
[+] [-] tjwei|9 years ago|reply
I feel it might be an useful example for illustrating the intuition of ZCA and Wasserstein metric.
http://nbviewer.jupyter.org/github/tjwei/Animation-with-Iden...
[+] [-] crb002|9 years ago|reply
[+] [-] squeakynick|9 years ago|reply
[+] [-] jmatejka|9 years ago|reply