Correlation is often used to determine the association between two continuous variables. But it has a major flaw that often gets unnoticed.
Folks often draw conclusions using a correlation matrix without even looking at the data. However, the obtained statistics could be heavily driven by outliers or other artifacts.
This is demonstrated in the plots above. The addition of just two outliers changed the correlation and the regression line drastically.
Thus, looking at the data and understanding its underlying characteristics can save from drawing wrong conclusions. Statistics are important, but they can be highly misleading at times.
Share this post on LinkedIn: Post Link.
I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn.
Avi, I love this post! You do a great job with communicating something important in less than a minute. I've tried to explain this to others who cite a current talking point about conclusions from data without understanding the data. I'm going to link to this instead. Thanks.