Always Validate Your Output Variable Before Using Linear Regression
A comparison of what happens if you do vs if you don't.
The effectiveness of a linear regression model largely depends on how well our data satisfies the algorithm's underlying assumptions.
Linear regression inherently assumes that the residuals (actual-prediction) follow a normal distribution. One way this assumption may get violated is when your output is skewed.
As a result, it will produce an incorrect regression fit.
But the good thing is that it can be corrected. One common way to make the output symmetric before fitting a model is to apply a log transform.
It removes the skewness by evenly spreading out the data, making it look somewhat normal.
One thing to note is that if the output has negative values, a log transform will raise an error. In such cases, one can apply translation transformation first on the output, followed by the log.
What are some other ways that you use to address this?
👉 Read what others are saying about this post on LinkedIn: Post Link.
👉 If you liked this post, leave a heart react 🤍.
👉 If you love reading this newsletter, feel free to share it with friends!
Find the code for my tips here: GitHub.
I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.
Note that the assumption is about the conditional distribution of the target/response/dependent not the unconditional distribution.