3 Comments
Oct 17, 2023Liked by Avi Chawla

Your point about histograms is entirely accurate, but I find it odd to recommend a KDE plot as an alternative as it has exactly the same issue except with “smoothing bandwidth” instead of “bin width”.

You made an excellent analogy of checking if a regression summery is likely to be accurate by using a scatter plot to check for outliers. Similarity, for this case I find it’s best to check if binning will be accurate using a CDF plot.

Like a scatter plot CDF has the advantage of being “full resolution” with no rounding or binning and showing ALL the data points, so it shows the texture of your underlying data much better. If it’s generally smooth over a range then it’s “safe” to generate a histogram/KDE of the data in that range. But if it has sudden jumps then that’s where you need to be aware that different bin width / bandwidths may show different stories.

Expand full comment
author

Thanks for highlighting this, David.

My recommendation for a KDE was mostly for understanding the *distribution*. Typically, I see many folks using a histogram understand the data distribution, which, as discussed here, can be misleading. In such cases, I have often found KDE to be immensely useful. Of course, there are other plots too like Raincloud: https://www.blog.dailydoseofds.com/p/raincloud-plots-the-hidden-gem-of, which I often use as well.

CDF is a great option as well, no doubt, but to be honest, I have not used it that much. But I would love to know this from you:

Say I have discrete data. A CDF in this case will be generated from the PMF of the data, right? The PMF, in turn, will depend on the bin width you select. When you will create a CDF, wouldn't CDF be impacted by the bin width as well? Please let me know if I am missing something here.

And thanks for the suggestion, it is always interesting to hear more ideas :)

Expand full comment

I almost always use CDF with discrete data. Even with discrete data CDF has no bin width. Instead the CDF plot effectively becomes a series of stair steps at each discrete value.

This is actually very helpful because sometimes the underlying data IS discrete in some unknown way which is not representative in the raw units (e.g. you might be process raw floating point input, but the data could be a transformed version of some sensor that has only 8-bit resolution).

Doing some searching around I found a reasonable article specifically on advantages of CDF: https://www.andata.at/en/software-blog-reader/why-we-love-the-cdf-and-do-not-like-histograms-that-much.html

Expand full comment