Why Correlation (and Other Summary Statistics) Can Be Misleading

...And here's how to avoid drawing misleading conclusions.

Avi Chawla

Aug 15, 2023

Many data scientists solely rely on the correlation matrix to study the association between variables.

But unknown to them, the obtained statistic can be heavily driven by outliers.

This is evident from the image above.

The addition of just two outliers drastically changed:

the correlation
the regression fit

Thus, plotting the data is highly important.

This can save you from drawing wrong conclusions, which you may have drawn otherwise by solely looking at the summary statistics.

One thing that I often do when using a correlation matrix is creating a PairPlot as well (shown below).

This lets me infer if the scatter plot of two variables and their corresponding correlation measure resonate with each other or not.

👉 Over to you: What are some other measures you take when using summary statistics?

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

Thanks for reading!

Latest full articles

If you’re not a full subscriber, here’s what you missed last month:

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

I want to read full articles.

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

Jon

Aug 28, 2023

How to deal with that 2 outliers? Please advise

Expand full comment

2 replies by Avi Chawla and others

Joe Corliss

If there are too many variables to plot individually, then Spearman's rank correlation can provide a robust measure of the association between each pair of variables.

2 more comments...

Daily Dose of Data Science

Why Correlation (and Other Summary Statistics) Can Be Misleading

...And here's how to avoid drawing misleading conclusions.

Latest full articles

Discussion about this post