If you are sure there are indeed outliers included in the dataset due to genuine errors, the best thing could be to remove them altogether.
But concluding that would need a fair bit of analysis.
As you have many more columns in the dataset, it is advised to analyse these two data points specifically by considering other columns as well. It may happen that these values act as potential outliers only for the variable you plotted it against with.
One more thing that you may try is to fit robust linear models on the entire data like:
If there are too many variables to plot individually, then Spearman's rank correlation can provide a robust measure of the association between each pair of variables.
How to deal with that 2 outliers? Please advise
If you are sure there are indeed outliers included in the dataset due to genuine errors, the best thing could be to remove them altogether.
But concluding that would need a fair bit of analysis.
As you have many more columns in the dataset, it is advised to analyse these two data points specifically by considering other columns as well. It may happen that these values act as potential outliers only for the variable you plotted it against with.
One more thing that you may try is to fit robust linear models on the entire data like:
- RANSAC Regression: https://www.blog.dailydoseofds.com/p/the-limitation-of-linear-regression
- Theil-Sen Regression: https://www.blog.dailydoseofds.com/p/theil-sen-regression-the-robust-twin
... and see if the final model includes these data points.
There is of course no one-size-fits-all. One has to try many different things before drawing any conclusion :)
Thank you
If there are too many variables to plot individually, then Spearman's rank correlation can provide a robust measure of the association between each pair of variables.