top | item 39974229

Histograms for Probability Density Estimation: A Primer

27 points| vvanirudh | 1 year ago |vvanirudh.github.io

12 comments

bagrow|1 year ago

The best way to compute the empirical CDF (ECDF) is by sorting the data:

    N = len(data)
    X = sorted(data)
    Y = np.arange(N)/N
    plt.plot(X,Y)

Technically, you should plot this with `plt.step`.

andrewla|1 year ago

scipy even has a built-in method (scipy.stats.ecdf) for doing exactly this.

sobriquet9|1 year ago

Why estimate PDF through histogram then convert to CDF, when one can estimate CDF directly? Doing so also avoids having to choose bin width that can have substantial impact.

andrewla|1 year ago

Agreed -- very odd to use a parameter (bin width) in a nonparametric estimation. Just use the raw data. In numerical analysis, broadly speaking, integrals are stable while derivatives are wild; an empirical cdf is a nice smooth integral of the messy pdf.

Bostonian|1 year ago

If the data is continuous, use kernel density estimation (KDE) instead of histograms to visualize the probability density, since KDE will give a smoother fit. A similar idea is to fit a mixture of normals -- there are numerous R packages for this and sklearn.mixture.GaussianMixture in SciPy.

vvanirudh|1 year ago

Yep! The next post would be on Kernel density estimation -- wanted to start from histograms as they are still a useful tool in 1-D and 2-D density estimation, and you don't have to store the data either (unlike KDE)