top | item 14823657

What's so hard about histograms?

152 points| robertkrahn01 | 8 years ago |tinlizzie.org

18 comments

order
[+] lukego|8 years ago|reply
What a beautiful presentation!

Tangentially: I am really enjoying the book "All of Statistics" as a reference for better understanding things like histograms, kernel density functions, etc, and their parameters.

https://www.amazon.com/All-Statistics-Statistical-Inference-...

[+] vanderZwan|8 years ago|reply
If you're interested in histograms, I highly recommend "Expressing complex data aggregations with Histogrammar" by Jim Pivarski, where he talks about how for decades histograms have been used in unique ways to do amazing things in high energy physics (HEP, arguably the original Big Data field in compsci):

https://www.youtube.com/watch?v=mB4Chl0ly-g

One thing early on is that HEP histograms treats histograms as a kind of accumulator that can stream in data (because the amount of data processed was typically too big to load into RAM all at once), instead of a chart. From that starting point you can add, divide, multiply histograms with histograms to build crazy things.

The results are no longer really histograms of course, but it's fun to see how something that we just think of as a chart can be (ab)used like that.

[+] jtxx000|8 years ago|reply
Kernel density plots should be preferred to histograms in nearly all cases. Histograms can be seen as a kernel density plot with a uniform kernel that has been sampled. Since a kernel density plot with a uniform kernel has unbounded frequency content, this sampling introduces aliasing, which is why you get all of these strange effects when adjusting the bin width and offset. In fact, if the distribution of your data happens to be a sine wave, then the histogram will also be a sine wave, but, due to aliasing, it may have a different frequency and phase.

For a kernel density plot with a Gaussian kernel, the kernel size does effect the result, but the situation is much better than with histograms for two reasons:

1. The kernel density plot varies smoothly as the kernel size changes, and so there is greater confidence that you have seen the whole story by only looking at a few kernel sizes.

2. You can construct a kernel density plot with a larger kernel given only a kernel density plot with a smaller kernel. Since the convolutions of two Gaussians produces a new Gaussian with a variance equal to the sum of the input variances, you only have to convolve the small-kernel plot with another Gaussian to produce the large-kernel plot. This, again, means that you have more confidence that you've seen the whole story by looking at only a few kernel sizes.

As a side note, there is technically a 1:1 relationship between 1D datasets and kernel density plots with a Gaussian kernel, and so in theory you don't lose any information by constructing the kernel density plot. In practice, however, you do lose information due to limited precision.

[+] svara|8 years ago|reply
When you think you want to plot a histogram, it's often a better idea to plot a (empirical) cumulative distribution [0] instead. You don't have to worry about how to select your bin limits and you can usually put several in the same plot for comparison without making it unreadable due to overlap.

[0] https://en.wikipedia.org/wiki/Empirical_distribution_functio....

[+] aji|8 years ago|reply
I like using cumulative distributions because they make small changes in the data a little more obvious. e.g. if all the buckets are 10 but there's a section where they're 11, that difference will show up in a cumulative distribution as a bend in an otherwise straight line, which in my opinion is a much easier difference to see
[+] wodenokoto|8 years ago|reply
Is there a way to read this decently on mobile?

I've tried Firefox reading mode as well as pocket but they both cut off large parts of the text.

[+] acbart|8 years ago|reply
In my introductory programming class, we teach a few basic forms of chart visualization. By far, students struggle the most with Histograms. Even more frustrating, they love line plots and attempt to use them everywhere. Despite my explanations that you can almost always use histograms, and you can almost never use line plots! Yet they go with what they find more intuitive...
[+] pletnes|8 years ago|reply
Depends what you're doing I'd say. In physics based modeling, which is used more in e.g. engineering, line plots are often very useful. When examining noisy real world data, not so much.
[+] SeanLuke|8 years ago|reply
Unfortunate that they're talking about distributions and yet the very first example they use ("The paintings of Bob Ross") isn't a distribution.
[+] RodericDay|8 years ago|reply
> We notice that you're not using the Google Chrome browser. You're welcome to try continuing—but if some parts of the essay are rendering or behaving strangely, please try Chrome instead.

what a world

[+] shdon|8 years ago|reply
20 years ago we had the "this site works best in Internet Explorer" buttons on way too many sites. Plus ça change...

That said, the site works just fine in Firefox, Edge and even IE11 too. So, if anything, the message is a sign of sloppiness in not even bothering to check.