Why Correlation (and Other Summary…

Avi Chawla

Aug 15, 2023

...And here's how to avoid drawing misleading conclusions.

Read →

4 Comments

Jon

Aug 28, 2023

How to deal with that 2 outliers? Please advise

Expand full comment

Reply (1)

Avi Chawla

Aug 28, 2023

If you are sure there are indeed outliers included in the dataset due to genuine errors, the best thing could be to remove them altogether.

But concluding that would need a fair bit of analysis.

As you have many more columns in the dataset, it is advised to analyse these two data points specifically by considering other columns as well. It may happen that these values act as potential outliers only for the variable you plotted it against with.

One more thing that you may try is to fit robust linear models on the entire data like:

- RANSAC Regression: https://www.blog.dailydoseofds.com/p/the-limitation-of-linear-regression

- Theil-Sen Regression: https://www.blog.dailydoseofds.com/p/theil-sen-regression-the-robust-twin

... and see if the final model includes these data points.

There is of course no one-size-fits-all. One has to try many different things before drawing any conclusion :)

Expand full comment

Reply (1)

Jon

Aug 28, 2023

Thank you

Expand full comment

Joe Corliss

Aug 15, 2023

If there are too many variables to plot individually, then Spearman's rank correlation can provide a robust measure of the association between each pair of variables.

Expand full comment

Daily Dose of Data Science

Why Correlation (and Other Summary…