top | item 40771513

(no title)

For what it's worth, you've convinced me that my beloved box plots need to be explained if I want to use them again.

The SVG you've provided clearly shows that the box plot splits the data in 4. The interquartile range (IQR) is clearly marked and it even has a comparison for what the standard deviation (variance) measure would be.

Secondly, if the data truly came from a normal distribution, there are no outliers. Outliers are data points which cannot be explained by the model and need to be removed. Unless you have a good reason to exclude the data points they should be included. This is why I like the IQR and the median, they are not swayed by a few wide valued data points. The 1.5*IQR rejection filter I think is lazy and unjustified. Happy to discuss this point further as it is a bug bear of mine.

discuss

blueflow|1 year ago

When i said "splitting", i meant it like my parent explained: Basically sorting your datasets and then splitting into quarters.

What you want to explain to me (IMHO to the wrong person) is the correct approach of calculating a mean and standard deviation and drawing the box from that. Lets stay with that (and thats what i said earlier in the thread)

After i wrote the post you replied to, i realized that the pure "splitting" method for box plots is nonsensical since the outer brackets interval is determined by the two most extreme values. They are too random to be meaningful. It does not make sense to draw a box plot from that.

ColFrancis|1 year ago

The quartiles are defined by doing the sorting and splitting algorithm. So if you want quartiles (or any other quantile generally) you need to calculate it that way. The mean and standard deviation (sigma) are fundamentally different, which is why the image you linked shows them to contrast against the quantiles.

If you want to represent the standard deviation with your box plot, you can calculate it using standard formulas, many maths libraries have them built in. I don't know how to plot it using any graphing package though. ggplot, plotly and matlab all use the quantiles (the ones I have experience with). Perhaps where ever you learned to read them as mean and standard devation has a reference you could use?

> They are too random to be meaningful. It does not make sense to draw a box plot from that.

This can be a problem. In practice, the distributions I see don't go too crazy and are bounded (production rates can't be negative and can't be infinite). I prefer to use the 10th and 90th percentiles which are well defined and better behaved for most distributions. I do make sure it's very clearly marked on each plot though as it's not standard. Using the 1.5 x IQR cutoff is no better though as when you have enough samples you find that the whiskers just travel out to the cutoff.